EPG'p0817073' [htlp://www.getlhepatent.com/Lo9in.dog/Spkennedy/@5681%2D9S101/Fetch/EP00081 7 073.c pc?pnum=EP00081 7073toolbar=bottompaPaqe 1 of 38 



(19) 



J 



Europaisches Patentamt 
European Patent Office 
Office europeen des brevets 



15 



(12) 



(45) Date of publication and mention 
of the grant of the patent: 
07.05.2003 Bulletin 2003/19 

(21) Application number: 97304648.5 

(22) Date of filing: 27.06.1997 



(H) EP 0 817 073 B1 

EUROPEAN PATENT SPECIFICATION 

(51) mtci7 : G06F 12/08 




(54) A multiprocessing system configured to perform efficient write operations 

Multiprozessorsystem ausgestaltet zur effizienten Ausfuhrung von Schreiboperationen 
Systeme multiprocesseur capable d'executer efficacement des operations d'ecriture 



(84) Designated Contracting States: 
DE FR GB IT NL SE 

(30) Priority: 01.07.1996 US 675634 

(43) Date of publication of application: 
07.01.1998 Bulletin 1998/02 

(73) Proprietor: Sun Microsystems, Inc. 
Santa Clara, California 95054 (US) 

(72) Inventor: Hagersten, Erik E. 

Palo Alto, California 94043 (US) 

(74) Representative: Harris, Ian Richard et al 
D. Young & Co., 

21 New Fetter Lane 
London EC4A1DA (GB) 



m 

CO 



00 

o 

Q. 

LU 



(56) References cited: 
US-A- 5 091 846 

• "METHOD OF STORING INTO NON-EX LINES 
WITH PORCESSOR PRIORITY 
MULTIPROCESSOR CACHES" IBM TECHNICAL 
DISCLOSURE BULLETIN, vol. 33, no. 11 , 1 April 
1 991 , pages 31 3-31 6, XP0001 1 041 2 

• TODD MOWRY ET AL: "TOLERATING LATENCY 
THROUGH SOFTWARE-CONTROLLED 
PREFETCHING IN SHARED-MEMORY 
MULTIPROCESSORS" JOURNALOFPARALLEL 
AND DISTRIBUTED COMPUTING, vol. 12, no. 2, 
1 June 1991, pages 87-106, XP000227966 

• MORI S -I ET AL: n A DISTRIBUTED SHARED 
MEMORY MULTIPROCESSOR: ASURA- 
MEMORY AND CACHE ARCHITECTURES" 
PROCEEDINGS OF THE SUPERCOMPUTING 
CONFERENCE, PORTLAND, NOV. 15 - 19, 1993, 
no. -, 15 November 1993, INSTITUTE OF 
ELECTRICAL AND ELECTRONICS ENGINEERS, 
pages 740-749, XP000437411 



Note: Within nine months from the publication of the mention of the grant of the European patent, any person may give 
notice to the European Patent Office of opposition to the European patent granted. Notice of opposition shall be filed in 
a written reasoned statement. It shall not be deemed to have been filed until the opposition fee has been paid. (Art. 
99(1) European Patent Convention). 



Printed by Jouve, 75001 PARIS (FR) 



ETOQ0817073 rhttp://www.qetthepa tent.c om/ L ogin.do g/$ pk enne dy/@56 81 %2 D9 5101/F e tc h /EP00081 7073cpc?pnum=EP0Q081 7 0 73 toolbar=bo ttompaPaqe 2 of 38 



1 EP 0 817 073 B1 2 



Description 

[0001] This invention relates to the field of multiproc- 
essor computer systems and, more particularly, to per- 
formance of write operations in multiprocessor compu- 5 
ter systems. 

[0002] Multiprocessing computer systems include 
two or more processors which may be employed to per- 
form computing tasks. A particular computing task may 
be performed upon one processor while other proces- 
sors perform unrelated computing tasks. Alternatively, 
components of a particular computing task may be dis- 
tributed among multiple processors to decrease the time 
required to perform the computing task as a whole. Gen- 
erally speaking, a processor is a device configured to 
perform an operation upon one or more operands to pro- 
duce a result. The operation is performed in response 
to an instruction executed by the processor. 
[0003] A popular architecture in commercial multi- 
processing computer systems is the symmetric multi- 
processor (SMP) architecture. Typically, an SMP com- 
puter system comprises multiple processors connected 
through a cache hierarchy to a shared bus. Additionally 
connected to the bus is a memory, which is shared 
among the processors in the system. Access to any par- 
ticular memory location within the memory occurs in a 
similar amount of time as access to any other particular 
memory location. Since each location in the memory 
may be accessed in a uniform manner, this structure is 
often referred to as a uniform memory architecture 
(UMA). 

[0004] Processors are often configured with internal 
caches, and one or more caches are typically included 
in the cache hierarchy between the processors and the 
shared bus in an SMP computer system. Multiple copies 
of data residing at a particular main memory address 
may be stored in these caches. In order to maintain the 
shared memory model,- in which a particular address 
stores exactly one data value at any given time, shared 
bus computer systems employ cache coherency. Gen- 
erally speaking, an operation is coherent if the effects 
of the operation upon data stored at a particular memory 
address are reflected in each copy of the data within the 
cache hierarchy. For example, when data stored at a 
particular memory address is updated, the update may 
be supplied to the caches which are storing copies of 
the previous data. Alternatively, the copies of the previ- 
ous data may be invalidated in the caches such that a 
subsequent access to the particular memory address 
causes the updated copy to be transferred from main 
memory. For shared bus systems, a snoop bus protocol 
is typically employed. Each coherent transaction per- 
formed upon the shared bus is examined (or "snooped") 
against data in the caches. If a copy of the affected data 
is found, the state of the cache line containing the data 
may be updated in response to the coherent transaction. 
[0005] Unfortunately, shared bus architectures suffer 
from several drawbacks which limit their usefulness in 



multiprocessing computer systems. A bus is capable of 
a peak bandwidth (e.g. a number of bytes/second which 
may be transferred across the bus). As additional proc- 
essors are attached to the bus, the bandwidth required 
to supply the processors with data and instructions may 
exceed the peak bus bandwidth. Since some proces- 
sors are forced to wait for available bus bandwidth, per- 
formance of the computer system suffers when the 
bandwidth requirements of the processors exceeds 
available bus bandwidth. 

[0006] Additionally, adding more processors to a 
shared-bus increases the capacitive loading on the bus 
and may even cause the physical length of the bus to 
be increased. The increased capacitive loading and ex- 
tended bus length increases the delay in propagating a 
signal across the bus. Due to the increased propagation 
delay, transactions may take longer to perform. There- 
fore, the peak bandwidth of the bus may actually de- 
crease as more processors are added. 
[0007] These problems are further magnified by the 
continued increase in operating frequency and perform- 
ance of processors. The increased performance ena- 
bled by the higher frequencies and more advanced 
processor microarchitectures results in higher band- 
width requirements than previous processor genera- 
tions, even for the same number of processors. There- 
fore, buses which previously provided sufficient band- 
width for a multiprocessing computer system may be in- 
sufficient for a similar computer system employing the 
higher performance processors. 
[0008] Another structure for multiprocessing compu- 
ter systems is a distributed shared memory architecture. 
A distributed shared memory architecture includes mul- 
tiple nodes within which processors and memory reside. 
The multiple nodes communicate via a network coupled 
there between. When considered as a whole, the mem- 
ory included within the multiple nodes forms the shared 
memory for the computer system. Typically, directories 
are used to identify which nodes have cached copies of 
data corresponding to a particular address. Coherency 
activities may be generated via examination of the di- 
rectories. 

[0009] Distributed shared memory systems are scale- 
able, overcoming the limitations of the shared bus ar- 
chitecture. Since many of the processor accesses are 
completed within a node, nodes typically have much 
lower bandwidth requirements upon the network than a 
shared bus architecture must provide upon its shared 
bus. The nodes may operate at high clock frequency 
and bandwidth, accessing the network when needed. 
Additional nodes may be added to the network without 
affecting the local bandwidth of the nodes. Instead, only 
the network bandwidth is affected. 
[0010] Unfortunately, processor access to memory 
stored in a remote node (i.e. a node other than the node 
containing the processor) is significantly slowerthan ac- 
cess to memory within the node. In particular, write op- 
erations may suffer from severe performance degrada- 
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tion in a distributed shared memory system. If a write 
operation is performed by a processor in a particular 
node and the particular node does not have write per- 
mission to the coherency unit affected by the write op- 
eration, then the write operation is typically stalled until 
write permission is acquired from the remainder of the 
system. Stalling the write may occupy processor re- 
sources (such as storage locations for the write data) 
until the write permission is acquired. Accordingly, the 
processor resources are not available for use by subse- 
quent operations, thus possibly further stalling proces- 
sor execution. A more efficient method for performing 
write operations in a distributed shared memory system 
is desired. 

[0011] IBM Technical Disclosure Bulletin, volume 33, 
number 11, 1 April 1991, pages 313-316, is directed to 
"method of storing into non-ex lines with processor pri- 
ority multiprocessor caches". This IBM disclosure de- 
scribes a technique whereby multiprocessor (MP) cach- 
es are provided a method of storing into non-ex lines 
with processor priority. The system operates with a to- 
ken S for the whole system. At any moment of time, the 
token can be held by at most one central processor 
(CP). The CP holding the token is allowed to do stores 
unconditionally. Thus, the IBM disclosure teaches that 
ownership of a token is a grant of permission to do stores 
unconditionally 

[0012] An article from the Journal of Parallel and Dis- 
tributed Computing, volume 12, number 2, 1 June 1991 
at pages 87-106 by T Mowry et al, entitled "Tolerating 
Latency Through Software-Controlled Prefetching in 
Shared-Memory Multiprocessors" describes a method 
of prefetching data using a write to a iocation in I/O ad- 
dress space. The article refers to the so-called DASH 
architecture that implements prefetching uses regular 
using a release consistency model. The DASH system 
commercial processors, with no special prefetch in- 
structions available. Consequently, to implement 
prefetching, the application's cacheable address space 
is double mapped into a portion of the I/O address 
space. To prefetch a location, a write is done to the cor- 
responding location in this special I/O address space. 
The advantage of using I/O write is that they get put into 
the write buffer and do not block the processor. Once 
the prefetch reaches the head of the write buffer, it is 
issued onto the bus. If the prefetch is for a remote mem- 
ory location, it is converted into a regular memory-re- 
quest message by the directory controller and sent to 
the home cluster. The prefetch response is stored in the 
remote access cache (RAC), a special 256 Kbyte cache 
associated with each cluster (see Figure 1 ), which then 
returns. When the processor subsequently reads the 
prefetched location, the data is supplied by the RAC. If 
the data has not arrived back at the RAC when the reg- 
ular request reaches it, the RAC is intelligent enough 
not to issue a duplicate request to the home cluster, the 
processor request is satisfied as soon as the reply to the 
original prefetch request arrives. 



[0013] Particular and preferred aspects of the inven- 
tion are set out in the accompanying independent and 
dependent claims. 

[0014] The problems outlined above are in large part 

5 solved by an embodiment of a computer system in ac- 
cordance with the claimed invention. The computer sys- 
tem defines a "fast write" protocol for performing certain 
write operations. Write operations include a particular 
encoding if they are to be performed using the fast write 

io protocol. When the system interface within a node de- 
tects the particular encoding, the write operation is cap- 
tured by the system interface. In addition, the data is 
transferred to the system interface from the processor 
performing the write operation. The data transfer is per- 

*5 formed even if the node is not maintaining a coherency 
state for the affected coherency unit which is consistent 
with performing the write operation. Instead, the coher- 
ency activity employed to acquire the proper coherency 
state is initiated subsequent to or in parallel with the re- 

20 ceipt of data from the processor. Advantageously, proc- 
essor resources are free to continue with other comput- 
ing tasks while the system interface performs coherency 
activity in response to the write operation. Particularly 
when a processor performs a large number of write op- 

25 erations in succession, performing the write operations 
using the fast write protocol may increase performance 
of the computer system. The write operations may be 
quickly transferred into the system interface instead of 
being stalled within the processor awaiting resources 

30 occupied by previous write operations. 

[001 5] Fast write operations are performed prior to ac- 
quiring write permission to the coherency unit. Ordering- 
with respect to other operations referencing the coher- 
ency unit is not maintained. Therefore, the fast write pro- 

35 tocol is not suitable for all write operations within the 
computer system. However, the protocol may be used 
to increase performance. For example, a group of writes 
enveloped by software synchronization operations ap- 
pear to be ordered as a group with respect to operations 

40 outside of the synchronization. The performance gained 
by executing the group of writes using the fast write pro- 
tocol may outweigh the system bandwidth used to per- 
form synchronization. 

[0016] Generally, a write operation is executed by a 
45 processor within a local processing node and a coher- 
ency operation to at least one remote processing node 
is performed in response to the write operation. If the 
write operation is coded as a fast write, the write oper- 
ation is completed within the local processing node prior 
so to ordering of the coherency operation globally. Con- 
versely, if the write operation is not coded as a fast write, 
then the write operation is completed within the local 
node subsequent to ordering of the coherency operation 
globally. 

55 [0017] Broadly speaking, an embodiment of the 
present invention provides a method for performing 
write operations in a multiprocessing computer system. 
A write operation is executed by a processor within a 
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Fig. 2 is a block diagram of one embodiment of an 
symmetric multiprocessing node depicted in Fig. 1 . 

Fig. 2A is an exemplary directory entry stored in one 
5 embodiment of a directory depicted in Fig. 2. 

Fig. 3 is a block diagram of one embodiment of a 
system interface shown in Fig. 1. 

10 Fig. 4 is a diagram depicting activities performed in 
response to a typical coherency operation between 
a request agent, a home agent, and a slave agent. 

Fig. 5 is an exemplary coherency operation per- 
15 formed in response to a read to own request from 
a processor. 

Fig. 6 is a flowchart depicting an exemplary state 
machine for one embodiment of a request agent 
20 shown in Fig. 3. 

Fig. 7 is a flowchart depicting an exemplary state 
machine for one embodiment of a home agent 
shown in Fig. 3. 

25 

Fig. 8 is a flowchart depicting an exemplary state 
machine for one embodiment of a slave agent 
shown in Fig. 3. 

30 Fig. 9 is a table listing request types according to 
one embodiment of the system interface. 

Fig. 1 0 is a table listing demand types according to 
one embodiment of the system interface. 

35 

Fig. 11 is a table listing reply types according to one 
embodiment of the system interface. 

Fig. 12 is a table listing completion types according 
40 to one embodiment of the system interface. 

Fig. 13 is a table describing coherency operations 
in response to various operations performed by a 
processor, according to one embodiment of thesys- 
45 tern interface. 

Fig. 14 is a diagram depicting a local physical ad- 
dress space including aliases. 
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local processing node of the multiprocessing computer 
system. A coherency operation to at least one remote 
processing node is performed in response to the write 
operation. If the write operation includes a specific pre- 
defined encoding, the write operation is completed with- 
in the local processing node prior to completion of the 
coherency operation. Alternatively, if the write operation 
includes an encoding different than the specific prede- 
fined encoding, the write operation is completed within 
the local processing node subsequent to completion of 
the coherency operation. 

[0018] An embodiment of the present invention fur- 
ther provides an apparatus for performing write opera- 
tions in a multiprocessing computer system comprising 
a processor and a system interface. The processor is 
configured to perform a write operation. Coupled to re- 
ceive the write operation and to perform a coherency 
operation in response to the write operation, the system 
interface is configured to complete the write operation 
with respect to the processor prior to completing the co- 
herency operation if the write operation includes a spe- 
cific predefined encoding. The system interface is fur- 
ther configured to inhibit completion of the write opera- 
tion with respect to the processor until completion of the 
coherency operation if the write operation includes a dif- 
ferent encoding than the specific predefined encoding. 
[001 9] An embodiment of the invention also provides 
a computer system comprising a first processing node 
and a second processing node. The first processing 
node includes at least one processor configured to per- 
form a write operation. Additionally, the first processing 
node is configured to.complete the write operation with 
respect to the processor prior to the first processing 
node acquiring a coherency state allowing the write op- 
eration if the write operation includes a predefined en- 
coding. The second processing node is configured as a 
home node of a coherency unit affected by the write op- 
eration. The second processing node is coupled to re- 
ceive a coherency request from the first processing 
node which conveys the coherency request in order to 
acquire the appropriate coherency state. 
[0020] Other objects and advantages of the invention 
will become apparent upon reading the following de- 
tailed description and upon reference to the accompa- 
nying drawings in which: 

Fig. 1 is a block diagram of a multiprocessor com- 
puter system. 

Fig. 1 A is a conceptualized block diagram depicting 50 
a non-uniform memory architecture supported by 
one embodiment of the computer system shown in 
Fig. 1. 

Fig. 1 B is a conceptualized block diagram depicting 55 
a cache-only memory architecture supported by 
one embodiment of the computer system shown in 
Fig. 1. 



Fig. 15 is a flow chart depicting steps executed by 
a system interface within the computer system 
shown in Fig. 1 to perform a write operation accord- 
ing to one embodiment. 

Fig. 16 is a block diagram of a portion of one em- 
bodiment of an SM P node shown in Fig. 1 , depicting 
performance of a write operation. 
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Fig. 17 is a diagram depicting coherency activities 
performed by one embodiment of the computersys- 
tem shown in Fig. 1 in response to a write operation. 

Fig. 1 8 is a timing diagram depicting a write stream s 
operation. 

Fig. 19 is a timing diagram depicting a fast write 
stream operation. 

w 

[0021] Turning now to Fig. 1 , a block diagram of one 
embodiment of a multiprocessing computer system 10 
is-shown. Computer system 10 includes multiple SMP 
nodes 12A-12D interconnected by a point-to-point net- 
work 14. Elements referred to herein with a particular *5 
reference number followed by a letter will be collectively 
referred to by the reference number alone. For example, 
SMP nodes 12A-12D will be collectively referred to as 
SMP nodes 12. In the embodiment shown, each SMP 
node 1 2 includes multiple processors, external caches, 20 
an SMP bus, a memory, and a system interface. For ex- 
ample, SMP node 12A is configured with multiple proc- 
essors including processors 16A-16B. The processors 
16 are connected to external caches 18, which are fur- 
ther coupled to an SMP bus 20. Additionally, a memory 25 
22 and a system interface 24 are coupled to SMP bus 
20. Still further, one or more input/output (I/O) interfaces 
26 may be coupled to SM P bus 20. I/O interfaces 26 are 
used to interface to peripheral devices such as serial 
and parallel ports, disk drives, modems, printers, etc. 30 
Other SMP nodes 12B-1 2D may be configured similarly. 
[0022] Generally speaking, computer system 10 is 
optimized for performing write operations from a local 
SMP node 12 to a remote SMP node 12. A processor 
16 within the local SMP node 12 performs a write oper- 35 
ation having a specific encoding indicating that the write 
operation is to be performed using a "fast write" protocol. 
System interface 24, upon detection of the "fast write" 
write operation, stores the write operation and also al- 
lows transfer of the data corresponding to the write op- 40 
eration from the processor into the system interface. In 
this case, the data is transferred prior to performing co- 
herency operations to acquire ownership of the coher- 
ency unit affected by the write operation (e.g. to acquire 
write permission to the coherency unit). Advantageous- 45 
ly, processor 16 completes the write operation quickly. 
Resources internal to processor 16 are freed for use in 
subsequent operations. Performance of the computer 
system may be increased by freeing processor resourc- 
es more rapidly than was previously achievable. 50 
[0023] In one particular embodiment, certain of the 
most significant bits of the address presented by proc- 
essor 16 upon SMP bus 20 indicate that the fast write 
protocol is to be used for a particular write operation. 
The remaining bits specify the destination node and the 55 
local physical address identifying a destination storage 
location within memory 22 of the destination node. Al- 
ternatively, the remaining bits may be a global address 
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identifying a remote node which stores the affected co- 
herency unit. Additionally, the fast write protocol is re- 
stricted to write stream operations in the particular em- 
bodiment. Write stream operations update an entire co- 
herency unit. Therefore, the processor 16 performing 
the write stream operation need not obtain a copy of the 
coherency unit for updating. The fast write protocol ad- 
ditionally removes the ordering requirements for the 
write stream operations, allowing these operations to be 
removed from the processor 16 quickly. These write 
stream operations are'ordered with respect to each oth- 
er but not the other operations performed by the proc- 
essor 16. 

[0024] The fast write protocol may be useful for many 
purposes. Generally speaking, a write operation to be 
performed to a remote node and for which acquiring a 
local copy in the local node is not desired may be ad- 
vantageously performed via the fast write protocol. For 
example, a write operation using a global address upon 
SMP bus 20 may be performed using the fast write pro- 
tocol. As another example, a blockcopy of a local source 
block (e.g. a page) to a remote destination block may 
be performed. In order to perform the block copy oper- 
ation, a processor 16 reads data from the local source 
block and writes the data to the remote destination 
block. The processor 1 6 may write the data to the re- 
mote destination block using the fast write protocol. Ad- 
ditionally, large interprocessor communications blocks 
(i.e. several coherency units) may be transferred using 
the fast write protocol . Smaller blocks may not utilize the 
fast write protocol because a synchronizing operation 
may be required between transmittal of the communica- 
tions blocks and the setting of a flag indicating that the 
communications blocks are available for the receiving 
processor. 

[0025] Generally speaking, a memory operation is an 
operation causing transfer of data from a source to a 
destination. The source and/or destination may be stor- 
age locations within the initiator, or may be storage lo- 
cations within memory. When a source or destination is 
a storage location within memory, the source or desti- 
nation is specified via an address conveyed with the 
memory operation. Memory operations may be read or 
write operations. A read operation causes transfer of da- 
ta from a source outside of the initiator to a destination 
within the initiator. Conversely, a write operation causes 
transfer of data from a source within the initiator to a 
destination outside of the initiator. In the computer sys- 
tem shown in Fig. 1 , a memory operation may include 
one or more transactions upon SMP bus 20 as well as 
one or more coherency operations upon network 14. 

Architectural Overview 

[0026] Each SMP node 12 is essentially an SMP sys- 
tem having memory 22 as the shared memory. Proces- 
sors 1 6 are high performance processors. In one em- 
bodiment, each processor 16 is a SPARC processor 
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compliant with version 9 of the SPARC processor archi- 
tecture. It Is noted, however, that any processor archi- 
tecture may be employed by processors 1 6. 
[0027] Typically, processors 16 include internal in- 
struction and data caches. Therefore, external caches 
18 are labeled as L2 caches (for level 2, wherein the 
internal caches are level 1 caches). If processors 1 6 are 
not configured with internal caches, then external cach- 
es 18 are level 1 caches. It is noted that the "level" no- 
menclature is used to identify proximity of a particular 
cache to the processing core within processor 1 6. Level 
1 is nearest the processing core, level 2 is next nearest, 
etc. External caches 1 8 provide rapid access to memory 
addresses frequently accessed by the processor 16 
coupled thereto. It is noted that external caches 1 8 may 
be configured in any of a variety of specific cache ar- 
rangements. For example, set-associative or direct- 
mapped configurations may be employed by external 
caches 18. 

[0028] SMP bus 20 accommodates communication 
between processors 16 (through caches 18), memory 
22, system interface 24, and I/O interface 26. In one em- 
bodiment, SMP bus 20 includes an address bus and re- 
lated control signals, as well as a data bus and related 
control signals. Because the address and data buses 
are separate, a split-transaction bus protocol may be 
employed upon SMP bus 20. Generally speaking, a 
split-transaction bus protocol is a protocol in which a 
transaction occurring upon the address bus may differ 
from a concurrent transaction occurring upon the data 
bus. Transactions involving address and data include 
an address phase in which the address and related con- 
trol information is conveyed upon the address bus, and 
a data phase in which the data is conveyed upon the 
data bus. Additional address phases and/or data phas- 
es for other transactions may be initiated prior to the da- 
ta phase corresponding to a particular address phase. 
An address phase and the corresponding data phase 
may be correlated in a number of ways. For example, 
data transactions may occur in the same order that the 
address transactions occur. Alternatively, address and 
data phases of a transaction may be identified via a 
unique tag. 

[0029] Memory 22 is configured to store data and in- 
struction code for use by processors 1 6. Memory 22 
preferably comprises dynamic random access memory 
(DRAM), although any type of memory may be used. 
Memory 22, in conjunction with similar illustrated mem- 
ories in the other SMP nodes 12, forms a distributed 
shared memory system. Each address in the address 
space of the distributed shared memory is assigned to 
a particular node, referred to as the home node of the 
address. A processor within a different node than the 
home node may access the data at an address of the 
home node, potentially caching the data. Therefore, co- 
herency is maintained between SMP nodes 12 as well 
as among processors 1 6 and caches 1 8 within a partic- 
ular SMP node 12A-12D. System interface 24 provides 
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internode coherency, while snooping upon SMP bus 20 
provides intranode coherency. 
[0030] In addition to maintaining internode coherency, 
system interface 24 detects addresses upon SMP bus 

5 20 which require a data transfer to or from another SMP 
node 1 2. System interface 24 performs the transfer, and 
provides the corresponding data for the transaction up- 
on SMP bus 20. In the embodiment shown, system in- 
terface 24 is coupled to a point-to-point network 14. 

10 However, it is noted that in alternative embodiments oth- 
er networks may be used. In a point-to-point network, 
individual connections exist between each node upon 
the network. A particular node communicates directly 
with a second node via a dedicated link. To communi- 

'5 cate with a third node, the particular node utilizes a dif- 
ferent link than the one used to communicate with the 
second node. 

[0031] It is noted that, although four SMP nodes 12 
are shown in Fig. 1 , embodiments of computer system 

20 1 o employing any number of nodes are contemplated. 
[0032] Figs. 1A and 1B are conceptualized illustra- 
tions of distributed memory architectures supported by 
one embodiment of computer system 10. Specifically, 
Figs. 1 A and 1 B illustrate alternative ways in which each 

25 SMP node 12 of Fig. 1 may cache data and perform 
memory accesses. Details regarding the manner in 
which computer system 10 supports such accesses will 
be described in further detail below. 
[0033] Turning now to Fig. 1A, a logical diagram de- 

so picting a first memory architecture 30 supported by one 
embodiment of computer system 1 0 is shown. Architec- 
ture 30 includes multiple processors 32A-32D, multiple 
caches 34A-34D, multiple memories 36A-36D, and an 
interconnect network 38. The multiple memories 36 

35 form a distributed shared memory. Each address within 
the address space corresponds to a location within one 
of memories 36. 
. [0034] Architecture 30 is a non-uniform memory ar- 
chitecture (NUMA). In a NUMA architecture, the amount 

40 of time required to access a first memory address may 
be substantially different than the amount of time re- 
quired to access a second memory address. The access 
time depends upon the origin of the access and the lo- 
cation of the memory 36A-36D which stores the ac- 

45 cessed data. For example, if processor 32A accesses a 
first memory address stored in memory 36A, the access 
time may be significantly shorter than the access time 
for an access to a second memory address stored in 
one of memories 36B-36D. That is, an access by proc- 

50 essor 32A to memory 36A may be completed locally (e. 
g. without transfers upon network 38), while a processor 
32A access to memory 36B is performed via network 
38. Typically, an access through network 38 is slower 
than an access completed within a local memory. For 

55 example, a local access might be completed in a few 
hundred nanoseconds while an access via the network 
might occupy a few microseconds. 
[0035] Data corresponding to addresses stored in re- 
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mote nodes may be cached in any of the caches 34. 
However, once a cache 34 discards the data corre- 
sponding to such a remote address, a subsequent ac- 
cess to the remote address is completed via a transfer 
upon network 38. 5 
[0036] NUMA architectures may provide excellent 
performance characteristics for software applications 
which use addresses that correspond primarily to a par- 
ticular local memory. Software applications which exhib- 
it more random access patterns and which do not con- 
fine their memory accesses to addresses within a par- 
ticular local memory, on the other hand, may experience 
a large amount of network traffic as a particular proces- 
sor 32 performs repeated accesses to remote nodes. 
[0037] Turning now to Fig. 1 B, a logic diagram depict- 
ing a second memory architecture 40 supported by the 
computer system 10 of Fig. 1 is shown. Architecture 40 
includes multiple processors 42A-42D, multiple caches 
44A-44D, multiple memories 46A-46D, and network 48. 
However, memories 46 are logically coupled between 
caches 44 and network 48. Memories 46 serve as larger 
caches (e.g. a level 3 cache), storing addresses which 
are accessed by the corresponding processors 42. 
Memories 46 are said to "attracr the data being oper- 
ated upon by a corresponding processor 42. As op- 
posed to the NUMA architecture shown in Fig. 1A, ar- 
chitecture 40 reduces the number of accesses upon the 
network 48 by storing remote data in the local memory 
when the local processor accesses that data. 
[0038] Architecture 40 is referred to as a cache-only 
memory architecture (COMA). Multiple locations within 
the distributed shared memory formed by the combina- 
tion of memories 46 may store data corresponding to a 
particular address. No permanent mapping of a partic- 
ular address to a particular storage location is assigned. 
Instead, the location storing data corresponding to the 
particular address changes dynamically based upon the 
processors 42 which access that particular address. 
Conversely, in the NUMA architecture a particular stor- 
age location within memories 46 is assigned to a partic- 
ular address. Architecture 40 adjusts to the memory ac- 
cess patterns performed by applications executing ther- 
eon, and coherency is maintained between the memo- 
ries 46. 

[0039] In a preferred embodiment, computer system 
10 supports both of the memory architectures shown in 
Figs. 1 A and 1 B. In particular, a memory address may 
be accessed in a NUMA fashion from one SMP node 
1 2A-12D while being accessed in a COMA manner from 
another SMP node 12A-12D. In one embodiment, a NU- 
MA access is detected if certain bits of the address upon 
SMP bus 20 identify another SMP node 12 as the home 
node of the address presented. Otherwise, a COMA ac- 
cess is presumed. Additional details will be provided be- 
low. 

[0040] In one embodiment, the COMA architecture is 
implemented using a combination of hardware and soft- 
ware techniques. Hardware maintains coherency be- 
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tween the locally cached copies of pages, and software 
(e.g. the operating system employed in computer sys- 
tem 10) is responsible for allocating and deallocating 
cached pages. 

[0041] Fig. 2 depicts details of one implementation of 
an SMP node 1 2A that generally conforms to the SMP 
node 1 2A shown in Fig. 1 . Other nodes 1 2 may be con- 
figured similarly. It is noted that alternative specific im- 
plementations of each SMP node 12 of Fig. 1 are also 
possible. The implementation of SMP node 12A shown 
in Fig. 2 includes multiple subnodes such as subnodes 
50A and SOB. Each subnode 50 includes two proces- 
sors 1 6 and corresponding caches 1 8, a memory portion 
56, an address controller 52, and a data controller 54. 
The memory portions 56 within subnodes 50 collectively 
form the memory 22 of the SMP node12Aof Fig. 1. Oth- 
er subnodes (not shown) are further coupled to SMP bus 
20 to form the I/O interfaces 26. 
[0042] As shown in Fig. 2, SMP bus 20 includes an 
address bus 58 and a data bus 60. Address controller 
52 is coupled to address bus 58, and data controller 54 
is coupled to data bus 60. Fig. 2 also illustrates system 
interface 24, including a system interface logic block 62, 
a translation storage 64, a directory 66, and a memory 
tag (MTAG) 68. Logic block 62 is coupled to both ad- 
dress bus 58 and data bus 60, and asserts an ignore 
signal 70 upon address bus 58 under certain circum- 
stances as will be explained further below. Additionally, 
logic block 62 is coupled to translation storage 64, di- 
rectory 66, MTAG 68, and network 14. 
[0043] For the embodiment of Fig. 2, each subnode 
50 is configured upon a printed circuit board which may 
be inserted into a backplane upon which SMP bus 20 is 
situated. In this manner, the number of processors and/ 
or I/O interfaces 26 included within an SM P node 1 2 may 
be varied by inserting or removing subnodes 50. For ex- 
ample, computer system 10 may initially be configured 
with a small number of subnodes 50. Additional subn- 
odes 50 may be added from time to time as the comput- 
ing power required by the users of computer system 1 0 
grows. 

[0044] Address controller 52 provides an interface be- 
tween caches 18 and the address portion of SMP bus 
20. In the embodiment shown, address controller 52 in- 
cludes an out queue 72 and some number of in queues 
74. Out queue 72 buffers transactions from the proces- 
sors connected thereto until address controller 52 is 
granted access to address bus 58. Address controller 
52 performs the transactions stored in out queue 72 in 
the order those transactions were placed into out queue 
72 (i.e. out queue 72 is a FIFO queue). Transactions 
performed by address controller 52 as well as transac- 
tions received from address bus 58 which are to be 
snooped by caches 1 8 and caches internal to proces- 
sors 16 are placed into in queue 74. 
[0045] Similar to out queue 72, in queue 74 is a FIFO 
queue. All address transactions are stored in the in 
queue 74 of each subnode 50 (even within the in queue 
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74 of the subnode 50 which initiates the address trans- 
action). Address transactions are thus presented to 
caches 1 8 and processors 1 6 for snooping in the order 
they occur upon address bus 58. The order that trans- 
actions occur upon address bus 58 is the order for SMP 
node 12A. However, the complete system is expected 
to have one global memory order. This ordering expec- 
tation creates a problem in both the NUMA and COMA 
architectures employed by computer system 10, since 
the global order may need to be established by the order 
of operations upon network 14. If two nodes perform a 
transaction to an address, the orderthat the correspond- 
ing coherency operations occur at the home node for 
the address defines the order of the two transactions as 
seen within each node. For'example, if two write trans- 
actions are performed to the same address, then the 
second write operation to arrive at the address' home 
node should be the second write transaction to complete 
(i.e. a byte location which is updated by both write trans- 
actions stores a value provided by the second write 
transaction upon completion of both transactions). How- 
ever, the node which performs the second transaction 
may actually have the second transaction occur first up- 
on SMP bus 20. Ignore signal 70 allows the second 
transaction to be transferred to system interface 24 with- 
out the remainder of the SMP node 12 reacting to the 
transaction. 

[0046] Therefore, in order to operate effectively with 
the ordering constraints imposed by the out queue/in 
queue structure of address controller 52, system inter- 
face logic block 62 employs ignore signal 70. When a 
transaction is presented upon address bus 58 and sys- 
tem interface logic block 62 detects that a remote trans- 
action is to be performed in response to the transaction, 
logic block 62 asserts the ignore signal 70. Assertion of 
the ignore signal 70 with respect to a transaction causes 
address controller 52 to inhibit storage of the transaction 
into in queues 74. Therefore, other transactions which 
may occur subsequent to the ignored transaction and 
which complete locally within SMP node 1 2A may com- 
plete out of order with respect to the ignored transaction 
without violating the ordering rules of in queue 74. In 
particular, transactions performed by system interface 
24 in response to coherency activity upon network 14 
may be performed and completed subsequent to the ig- 
nored transaction. When a response is received from 
the remote transaction, the ignored transaction ma/be 
reissued by system interface logic block 62 upon ad- 
dress bus 58. The transaction is thereby placed into in 
queue 74, and may complete in order with transactions 
occurring at the time of reissue. 
[0047] It is noted that in one embodiment, once a 
transaction from a particular address controller 52 has 
been ignored, subsequent coherent transactions from 
that particular address controller 52 are also ignored. 
Transactions from a particular processor 16 may have 
an important ordering relationship with respect to each 
other, independent of the ordering requirements im- 
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posed by presentation upon address bus 58. For exam- 
ple, a transaction may be separated from another trans- 
action by a memory synchronizing instruction such as 
the MEMBAR instruction included in the SPARC archi- 
5 tecture. The processor 16 conveys the transactions in 
the order the transactions are to be performed with re- 
spect to each other. The transactions are ordered within 
out queue 72, and therefore the transactions originating 
from a particular out queue 72 are to be performed in 
10 order. Ignoring subsequent transactions from a particu- 
lar address controller 52 allows the in-order rules for a 
particular out queue 72 to be preserved. It is further not- 
ed that not all transactions from a particular processor 
must be ordered. However, it is difficult to determine up- 
*5 on address bus 58 which transactions must be ordered 
and which transactions may not be ordered. Therefore, 
in this implementation, logic block 62 maintains the or- 
der of all transactions from a particular out queue 72. It 
is noted that other implementations of subnode 50 are 
possible that allow exceptions to this rule. 
[0048] Data controller 54 routes data to and from data 
bus 60, memory portion 56 and caches 18. Data con- 
troller 54 may include in and out queues similar to ad- 
dress controller 52. In one embodiment, data controller 
54 employs multiple physical units in a byte-sliced bus 
configuration. 

[0049] Processors 1 6 as shown in Fig. 2 include mem- 
ory management units (MMUs) 76A-76B. MMUs 76 per- 
form a virtual to physical address translation upon the 
data addresses generated by the instruction code exe- 
cuted upon processors 1 6, as well as the instruction ad- 
dresses. The addresses generated in response to in- 
struction execution are virtual addresses. In other 
words, the virtual addresses are the addresses created 
by the programmer of the instruction code. The virtual 
addresses are passed through an address translation 
mechanism (embodied in MMUs 76), from which corre- 
sponding physical addresses are created. The physical 
address identifies a storage location within memory 22. 
[0050] Address translation is performed for many rea- 
sons. For example, the address translation mechanism 
may be used to grant or deny a particular computing 
task's access to certain memory addresses. In this man- 
ner, the data and instructions within one computing task 
are isolated from the data and instructions of another 
computing task. Additionally, portions of the data and 
instructions of a computing task may be "paged out" to 
a hard disk drive. When a portion is paged out, the trans- 
lation is invalidated. Upon access to the portion by the 
computing task, an interrupt occurs due to the failed 
translation. The interrupt allows the operating system to 
retrieve the corresponding information from the hard 
disk drive. In this manner, more virtual memory may be 
available than actual memory in memory 22. Many other 
uses for virtual memory are well known. 
[0051] Referring back to the computer system 10 
shown in Fig. 1 in conjunction with the SMP node 12A 
implementation illustrated in Fig. 2, the physical address 
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computed by MM Us 76 is a local physical address (LPA) 
defining a location within the memory 22 associated with 
the SMP node 12 in which the processor 16 is located. 
MTAG 68 stores a coherency state for each "coherency 
unit" in memory 22. When an address transaction is per- 5 
formed upon SMP bus 20, system interface logic block 
62 examines the coherency state stored in MTAG 68 for 
the accessed coherency unit. If the coherency state in- 
dicates that the SMP node 12 has sufficient access 
rights to the coherency unit to perform the access, then 
the address transaction proceeds. If, however, the co- 
herency state indicates that coherency activity should 
be performed prior to completion of the transaction , then 
system interface logic block 62 asserts the ignore signal 
70. Logic block 62 performs coherency operations upon 
network 14 to acquire the appropriate coherency state. 
When the appropriate coherency state is acquired, logic 
block 62 reissues the ignored transaction upon SM P bus 
20. Subsequently, the transaction completes. 
[0052] Generally speaking, the coherency state main- 
tained for a coherency unit at a particular storage loca- 
tion (e.g. a cache or a memory 22) indicates the access 
rights to the coherency unit at that SMP node 12. The 
access right indicates the validity of the coherency unit, 
as well as the read/write permission granted forthe copy 
of the coherency unit within that SMP node 12. In one 
embodiment, the coherency states employed by com- 
puter system 10 are modified, owned, shared, and 
invalid. The modified state indicates that the SMP node 
12 has updated the corresponding coherency unit. 
Therefore, other SMP nodes 12 do not have a copy of 
the coherency unit. Additionally, when the modified co- 
herency unit is discarded by the SMP node 12, the co- 
herency unit is stored back to the home node. The 
owned state indicates that the SMP node 12 is respon- 
sible for the coherency unit, but other SMP nodes 12 
may have shared copies. Again, when the coherency 
unit is discarded by the SMP node 12, the coherency 
unit is stored back to the home node. The shared state 
indicates that the SMP node 12 may read the coherency 
unit but may not update the coherency unit without ac- 
quiring the owned state. Additionally, other SMP nodes 
12 may have copies of the coherency unit as well. Fi- 
nally, the invalid state indicates that the SMP node 12 
does not have a copy of the coherency unit. In one em- 
bodiment, the modified state indicates write permission 
and any state but invalid indicates read permission to 
the corresponding coherency unit. 
[0053] As used herein, a coherency unit is a number 
of contiguous bytes of memory which are treated as a 
unit for coherency purposes. For example, if one byte 
within the coherency unit is updated, the entire coher- 
ency unit is considered to be updated. In one specific 
embodiment, the coherency unit is a cache line, com- 
prising 64 contiguous bytes. It is understood, however, 
that a coherency unit may comprise any number of 
bytes. 

[0054] System interface 24 also includes a translation 
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mechanism which utilizes translation storage 64 to store 
translations from the local physical address to a global 
address (GA). Certain bits within the global address 
identify the home node forthe address, at which coher- 
ency information is stored for that global address. For 
example, an embodiment of computer system 10 may 
employ four SMP nodes 12 such as that of Fig. 1. In 
such an embodiment, two bits of the global address 
identify the home node. Preferably, bits from the most 
significant portion of the global address are used to iden- 
tify the home node. The same bits are used in the local 
physical address to identify NUMA accesses. If the bits 
of the LPA indicate that the local node is not the home 
node, then the LPA is a global address and the transac- 
tion is performed in NUMA mode. Therefore, the oper- 
ating system places global addresses in MMUs 76 for 
any NUMA-type pages. Conversely, the operating sys- 
tem places LPAs in MMU 76 for any COMA-type pages. 
It is noted that an LPA may equal a GA (for NUMA ac- 
cesses as well as for global addresses whose home is 
within the memory 22 in the node in which the LPA is 
presented). Alternatively, an LPA may be translated to 
a GA when the LPA identifies storage locations used for 
storing copies of data having a home in another SMP 
node 12. 

[0055] The directory 66 of a particular home node 
identifies which SMP nodes 12 have copies of data cor- 
responding to a given global address assigned to the 
home node such that coherency between the copies 
may be maintained. Additionally, the directory 66 of the 
home node identifies the SMP node 12 which owns the 
coherency unit. Therefore, while local coherency be- 
tween caches 18 and processors 16 is maintained via 
snooping, system-wide (or global) coherency is main- 
tained using MTAG 68 and directory 66. Directory 66 
stores the coherency information corresponding to the 
coherency units which are assigned to SMP node 12A 
(i.e. for which SMP node 12A is the home node). 
[0056] It is noted that for the embodiment of Fig. 2, 
directory 66 and MTAG 68 store information for each 
coherency unit (i.e., on a coherency unit basis). Con- 
versely, translation storage 64 stores local physical to 
global address translations defined for pages. A page 
includes multiple coherency units, and is typically sev- 
eral kilobytes or even megabytes in size. 
[0057] Software accordingly creates local physical 
address to global address translations on a page basis 
(thereby allocating a local memory page for storing a 
copy of a remotely stored global page). Therefore, 
blocks of memory 22 are allocated to a particular global 
address on a page basis as well. However, as stated 
above, coherency states and coherency activities are 
performed upon a coherency unit. Therefore, when a 
page is allocated in memory to a particular global ad- 
dress, the data corresponding to the page is not neces- 
sarily transferred to the allocated memory. Instead, as 
processors 16 access various coherency units within 
the page, those coherency units are transferred from the 
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owner of the coherency unit. In this manner, the data 
actually accessed by SMP node 12A is transferred into 
the corresponding memory 22. Data not accessed by 
SMP node 12A may not be transferred, thereby reduc- 
ing overall bandwidth usage upon network 14 in com- 
parison to embodiments which transfer the page of data 
upon allocation of the page in memory 22. 
[0058] It is noted that in one embodiment, translation 
storage 64, directory 66, and/or MTAG 68 may be cach- 
es which store only a portion of the associated transla- 
tion, directory, and MTAG information, respectively. The 
entirety of the translation, directory, and MTAG informa- 
tion is stored in tables within memory 22 or a dedicated 
memory storage (not shown). If required information for 
an access is not found in the corresponding cache, the 
tables are accessed by system interface 24. 
[0059] Turning now to Fig. 2A, an exemplary directory 
entry 71 is shown. Directory entry 71 may be employed 
by one embodiment of directory 66 shown in Fig. 2. Oth- 
er embodiments of directory 66 may employ dissimilar 
directory entries. Directory entry 71 includes a valid bit 
73, a write back bit 75, an owner field 77, and a sharers 
field 79. Directory entry 71 resides within the table of 
directory entries, and is located within the table via the 
global address identifying the corresponding coherency 
unit. More particularly, the directory entry 71 associated 
with a coherency unit is stored within the table of direc- 
tory entries at an offset formed from the global address 
which identifies the coherency unit. 
[0060] Valid bit 73 indicates, when set, that directory 
entry 71 is valid (i.e. that directory entry 71 is storing 
coherency information for a corresponding coherency 
unit). When clear, valid bit 73 indicates that directory en- 
try 71 is invalid. 

[0061 ] Owner field 77 identifies one of SM P nodes 1 2 
as the owner of the coherency unit. The owning SMP 
node 12A-12D maintainsthe coherency unit in eitherthe 
modified or owned states. Typically, the owning SMP 
node 12A-12D acquires the coherency unit in the mod- 
ified state (see Fig. 1 3 below). Subsequently, the owning 
SMP node 12A-12D may then transition to the owned 
state upon providing a copy of the coherency unit to an- 
other SMP node 12A-12D. The other SMP node 12A- 
12D acquires the coherency unit in the shared state. In 
one embodiment, owner field 77 comprises two bits en- 
coded to identify one of four SMP nodes 12A-12D as 
the owner of the coherency unit. 
[0062] Sharers field 79 includes one bit assigned to 
each SMP node 12A-12D. If an SMP node 12A-12D is 
maintaining a shared copy of the coherency unit, the 
corresponding bit within sharers field 79 is set. Con- 
versely, if the SMP node 12A-12D is not maintaining a 
shared copy of the coherency unit, the corresponding 
bit within sharers field 79 is clear. In this manner, sharers 
field 79 indicates ail of the shared copies of the coher- 
ency unit which exist within the computer system 1 0 of 
Fig. 1. 

[0063] Write back bit 75 indicates, when set, that the 



18 

SMP node 12A-12D identified as the owner of the co- 
herency unit via owner field 77 has written the updated 
copy of the coherency unit to the home SMP node 12. 
When clear, bit 75 indicates that the owning SMP node 

5 1 2A-1 2D has not written the updated copy of the coher- 
ency unit to the home SMP node 1 2A-1 2D. 
[0064] Turning now to Fig. 3 : a block diagram of one 
embodiment of system interface 24 is shown. As shown 
in Fig. 3, system interface 24 includes directory 66, 

10 translation storage 64, and MTAG 68. Translation stor- 
age 64 is shown as a global address to local physical 
address (GA2LPA) translation unit 80 and a local phys- 
ical address to global address (LPA2GA) translation unit 
82. 

15 [0065] System interface 24 also includes input and 
output queues for storing transactions to be performed 
upon SMP bus 20 or network 14. Specifically, for the 
embodiment shown, system interface 24 includes input 
header queue 84 and output header queue 86 for buff- 
ering header packets to and from network 14. Header 
packets identify an operation to be performed, and spec- 
ify the number and format of any data packets which 
may follow. Output header queue 86 buffers header 
packets to be transmitted upon network 14, and input 
header queue 84 buffers header packets received from 
network 14 until system interface 24 processes the re- 
ceived header packets. Similarly, data packets are buff- 
ered in input data queue 88 and output data queue 90 
until the data may be transferred upon SMP data bus 
60 and network 14, respectively. 
[0066] SMP out queue 92, SMP in queue 94, and 
SMP I/O in queue (PIQ) 96 are used to buffer address 
transactions to and from address bus 58. SMP out 
queue 92 buffers transactions to be presented by sys- 
tem interface 24 upon address bus 58. Reissue trans- 
actions queued in response to the completion of coher- 
ency activity with respect to an ignored transaction are 
buffered in SMP out queue 92. Additionally, transactions 
generated in response to coherency activity received 
from network 14 are buffered in SMP out queue 92. SMP 
in queue 94 stores coherency related transactions to be 
serviced by system interface 24. Conversely, SMP PIQ 
96 stores I/O transactions to be conveyed to an I/O in- 
terface residing in another SMP node 12. I/O transac- 
tions generally are considered non-coherent and there- 
fore do not generate coherency activities. 
[0067] SMP in queue 94 and SMP PIQ 96 receive 
transactions to be queued from a transaction filter 98. 
Transaction filter 98 is coupled to MTAG 68 and SMP 
address bus 58. If transaction fitter 98 detects an I/O 
transaction upon address bus 58 which identifies an M 
O interface upon another SM P node 1 2, transaction filter 
98 places the transaction into SM P PIQ 96. If a coherent 
transaction to an LPA address is detected by transaction 
filter 98, then the corresponding coherency state from 
MTAG 68 is examined. In accordance with the coheren- 
cy state, transaction filter 98 may assert ignore signal 
70 and may queue a coherency transaction in SMP in 



EP 0 817 073 B1 



25 



30 



35 



40 



45 



50 



10 



EP000817073 fhtt p://www. getmepatenU^ 11 o f 38 



19 EP 0 817 073 B1 20 



queue 94. Ignore signal 70 is asserted and a coherency 
transaction queued if MTAG 68 indicates that insuffi- 
cient access rights to the coherency unit for performing 
the coherent transaction is maintained by SMP node 
12A. Conversely, ignore signal 70 is deasserted and a 
coherency transaction is not generated if MTAG 68 in- 
dicates that a sufficient access right is maintained by 
SMP node 12A. 

[0068] Transactions from SMP in queue 94 and SMP 
P IQ 96 are processed by a request agent 1 00 within sys- 
tem interface 24. Prior to action by request agent 100, 
LPA2G A translation unit 82 translates the address of the 
transaction (if it is an LPA address) from the local phys- 
ical address presented upon SMP address bus 58 into 
the corresponding global address. Request agent 100 
then generates a header packet specifying a particular 
coherency request to be transmitted to the home node 
identified by the global address. The coherency request 
is placed into output header queue 86. Subsequently, a 
coherency reply is received into input header queue 84. 
Request agent 100 processes the coherency replies 
from input header queue 84, potentially generating re- 
issue transactions for SMP out queue 92 (as described 
below). 

[0069] Also included in system interface 24 is a home 
agent 1 02 and a slave agent 1 04. Home agent 1 02 proc- 
esses coherency requests received from input header 
queue 84. From the coherency information stored in di- 
rectory 66 with respect to a particular global address, 
home agent 102 determines if a coherency demand is 
to be transmitted to one or more slave agents in other 
SMP nodes 12. In one embodiment, home agent 102 
blocks the coherency information corresponding to the 
affected coherency unit. In other words, subsequent re- 
quests involving the coherency unit are not performed 
until the coherency activity corresponding to the coher- 
ency request is completed. According to one embodi- 
ment, home agent 102 receives a coherency completion 
from the request agent which initiated the coherency re- 
quest (via input header queue 84). The coherency com- 
pletion indicates that the coherency activity has com- 
pleted. Upon receipt of the coherency completion, home 
agent 1 02 removes the block upon the coherency infor- 
mation corresponding to the affected coherency unit. It 
is noted that, since the coherency information is blocked 
until completion of the coherency activity, home agent 
1 02 may update the coherency information in accord- 
ance with the coherency activity performed immediately 
when the coherency request is received. 
[0070] Slave agent 1 04 receives coherency demands 
from home agents of other SMP nodes 12 via input 
header queue 84. In response to a particular coherency 
demand, slave agent 1 04 may queue a coherency trans- 
action in SMP out queue 92. In one embodiment, the 
coherency transaction may cause caches 1 8 and cach- 
es internal to processors 16 to invalidate the affected 
coherency unit. If the coherency unit is modified in the 
caches, the modified data is transferred to system inter- 



face 24. Alternatively, the coherency transaction may 
cause caches 18 and caches internal to processors 16 
to change the coherency state of the coherency unit to 
shared. Once slave agent 1 04 has completed activity in 
5 response to a coherency demand, slave agent 104 
transmits a coherency reply to the request agent which 
initiated the coherency request corresponding to the co- 
herency demand. The coherency reply is queued in out- 
put header queue 86. Prior to performing activities in re- 
10 sponse to a coherency demand, the global address re- 
ceived with the coherency demand is translated to a lo- 
cal physical address via GA2LPA translation unit 80. 
[0071] According to one embodiment, the coherency 
protocol enforced by request agents 100, home agents 
102, and slave agents 104 includes a write invalidate 
policy. In other words, when a processor 16 within an 
SMP node 12 updates a coherency unit, any copies of 
the coherency unit stored within other SMP nodes 12 
are invalidated. However, other write policies may be 
used in other embodiments. For example, a write update 
policy may be employed. According to a write update 
policy, when an coherency unit is updated the updated 
data is transmitted to each of the copies of the coheren- 
cy unit stored in each of the SMP nodes 12. 
[0072] Turning next to Fig. 4, a diagram depicting typ- 
ical coherency activity performed between the request 
agent 1 00 of a first SM P node 1 2A-1 2D (the "requesting 
node"), the home agent 1 02 of a second SMP node 1 2A- 
12D (the "home node"), and the slave agent 104 of a 
third SMP node 12A-12D (the "slave node") in response 
to a particular transaction upon the SMP bus 20 within 
the SMP node 12 corresponding to request agent 100 
is shown. Specific coherency activities employed ac- 
cording to one embodiment of computer system 10 as 
shown in Fig. 1 are further described below with respect 
to Figs. 9-13. Reference numbers 1 00, 1 02, and 1 04 are 
used to identify request agents, home agents, and slave 
agents throughout the remainder of this description. It 
is understood that, when an agent communicates with 
another agent, the two agents often reside in different 
SMP nodes 12A-12D. 

[0073] Upon receipt of a transaction from SMP bus 
20, request agent 100 forms a coherency request ap- 
propriate for the transaction and transmits the coheren- 
cy request to the home node corresponding to the ad- 
dress of the transaction (reference number 110). The 
coherency request indicates the access right requested 
by request agent 1 00, as well as the global address of 
the affected coherency unit. The access right requested 
is sufficient for allowing occurrence of the transaction 
being attempted in the SMP node 12 corresponding to 
request agent 100. 

[0074] Upon receipt of the coherency request, home 
agent 1 02 accesses the associated directory 66 and de- 
termines which SMP nodes 12 are storing copies of the 
affected coherency unit. Additionally, home agent 102 
determines the owner of the coherency unit. Home 
agent 102 may generate a coherency demand to the 



20 



25 



30 



35 



40 



45 



50 



11 



EP000817073 fhtt p:/A/vww.getth^ 12 of 38 



21 EP 0 817 073 B1 22 



slave agents 104 of each of the nodes storing copies of 
the affected coherency unit, as well as to the slave agent 
104 of the node which has the owned coherency state 
for the affected coherency unit (reference number 112). 
The coherency demands indicate the new coherency 
state for the affected coherency unit in the receiving 
SMP nodes 12. While the coherency request is out- 
standing, home agent 102 blocks the coherency infor- 
mation corresponding to the affected coherency unit 
such that subsequent coherency requests involving the 
affected coherency unit are not initiated by the home 
agent 1 02. Home agent 1 02 additionally updates the co- 
herency information to reflect completion of the coher- 
ency request. 

[0075] Home agent 102 may additionally transmit a 
coherency reply to request agent 100 (reference 
number 114). The coherency reply may indicate the 
number of coherency replies which are forthcoming 
from slave agents 104. Alternatively, certain transac- 
tions may be completed without interaction with slave 
agents 104. For example, an I/O transaction targeting 
an I/O interface 26 in the SMP node 12 containing home 
agent 1 02 may be completed by home agent 1 02. Home 
agent 102 may queue a transaction for the associated 
SMP bus 20 (reference number 1 1 6), and then transmit 
a reply indicating that the transaction is complete. 
[0076] A slave agent 1 04, in response to a coherency 
demand from home agent 1 02, may queue a transaction 
for presentation upon the associated SMP bus 20 (ref- 
erence number 118). Additionally, slave agents 104 
transmit a coherency reply to request agent 1 00 (refer- 
ence number 120). The coherency reply indicates that 
the coherency demand received in response to a par- 
ticular coherency request has been completed by that 
slave. The coherency reply is transmitted by slave 
agents 1 04 when the coherency demand has been com- 
pleted, or at such time prior to completion of the coher- 
ency demand at which the coherency demand is guar- 
anteed to be completed upon the corresponding SMP 
node 12 and at which no state changes to the affected 
coherency unit will be performed prior to completion of 
the coherency demand. 

[0077] When request agent 1 00 has received a coher- 
ency reply from each of the affected slave agents 1 04, 
request agent 1 00 transmits a coherency completion to 
home agent 102 (reference number 122). Upon receipt 
of the coherency completion, home agent 102 removes 
the block from the corresponding coherency informa- 
tion. Request agent 100 may queue a reissue transac- 
tion for performance upon SMP bus 20 to complete the 
transaction within the SMP node 12 (reference number 
124). 

[0078] It is noted that each coherency request is as- 
signed a unique tag by the request agent 100 which is- 
sues the coherency request. Subsequent coherency de- 
mands, coherency replies, and coherency completions 
include the tag. In this manner, coherency activity re- 
garding a particular coherency request may be identified 



by each of the involved agents. It is further noted that 
non-coherent operations maybe performed in response 
to non-coherent transactions (e.g. I/O transactions). 
Non-coherent operations may involve only the request- 
5 ing node and the home node. Still further, a different 
unique tag may be assigned to each coherency request 
by the home agent 102. The different tag identifies the 
home agent 1 02 , and is used for the coherency comple- 
tion in lieu of the requestor tag. 
10 [0079] Turning now to Fig. 5, a diagram depicting co- 
herency activity for an exemplary embodiment of com- 
puter system 10 in response to a read to own transaction 
upon SMP bus 20 is shown. A read to own transaction 
is performed when a cache miss is detected for a par- 
's ticular datum requested by a processor 1 6 and the proc- 
essor 16 requests write permission to the coherency 
unit. A store cache miss may generate a read to own 
transaction, for example. 

[0080] A request agent 100, home agent 102, and 
20 several slave agents 1 04 are shown in Fig. 5. The node 
receiving the read to own transaction from SMP bus 20 
stores the affected coherency unit in the invalid state (e. 
g. the coherency unit is not stored in the node). The sub- 
script Y in request node 100 indicates the invalid state. 
25 The home node stores the coherency unit in the shared 
state, and nodes corresponding to several slave agents 
1 04 store the coherency unit in the shared state as well. 
The subscript "s" in home agent 102 and slave agents 
1 04 is indicative of the shared state at those nodes. The 
30 read to own operation causes transfer of the requested 
coherency unit to the requesting node. The requesting 
node receives the coherency unit in the modified state. 
[0081] Upon receipt of the read to own transaction 
from SMP bus 20, request agent 100 transmits a read 
35 to own coherency request to the home node of the co- 
herency unit (reference number 130). The home agent 
1 02 in the receiving home node detects the shared state 
for one or more other nodes. Since the slave agents are 
each in the shared state, not the owned state, the home 
40 node may supply the requested data directly. Home 
agent 102 transmits a data coherency reply to request 
agent 100, including the data corresponding to the re- 
quested coherency unit (reference number 132). Addi- 
tionally, the data coherency reply indicates the number 
45 of acknowledgments which are to be received from 
slave agents of other nodes prior to request agent 1 00 
taking ownership of the data. Home agent 102 updates 
directory 66 to indicate that the requesting SMP node 
12A-12D is the owner of the coherency unit, and that 
50 each of the other SMP nodes 12A-12D is invalid. When 
the coherency information regarding the coherency unit 
is unblocked upon receipt of a coherency completion 
from request agent 100, directory 66 matches the state 
of the coherency unit at each SMP node 12. 
55 [0082] Home agent 1 02 transmits invalidate coheren- 
cy demands to each of the slave agents 104 which are 
maintaining shared copies of the affected coherency 
unit (reference numbers 134A, 134B, and 134C). The 
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invalidate coherency demand causes the receiving 
slave agent to invalidate the corresponding coherency 
unit within the node, and to send an acknowledge co- 
herency reply to the requesting node indicating comple- 
tion of the invalidation. Each slave agent 1 04 completes 5 
invalidation of the coherency unit and subsequently 
transmits an acknowledge coherency reply (reference 
numbers 136A, 136B, and 136C). In one embodiment, 
each of the acknowledge replies includes a count of the 
total number of replies to be received by request agent io 
100 with respect to the coherency unit. 
[0083] Subsequent to receiving each of the acknowl- 
edge coherency replies from slave agents 1 04 and the 
data coherency reply from home agent 102, request 
agent 1 00 transmits a coherency completion to home '5 
agent 102 (reference number 138). Request agent 100 
validates the coherency unit within its local memory, and 
home agent 102 releases the block upon the corre- 
sponding coherency information. It is noted that data co- 
herency reply 132 and acknowledge coherency replies 20 
136 may be received in any order depending upon the 
number of outstanding transactions within each node, 
among other things. 

[0084] Turning now to Fig. 6, a flowchart 140 depicting 
an exemplary state machine for use by request agents 25 
1 00 is shown. Request agents 1 00 may include multiple 
independent copies of the state machine represented 
by flowchart 140, such that multiple requests may be 
concurrently processed. 

[0085] Upon receipt of a transaction from SMP in 30 
queue 94, request agent 100 enters a request ready 
state 142. In request ready state 1 42, request agent 1 00 
transmits a coherency request to the home agent 102 
residing in the home node identified by the global ad- 
dress of the affected coherency unit. Upon transmission 35 
of the coherency request, request agent 1 00 transitions 
to a request active state 1 44. During request active state 
1 44, request agent 1 00 receives coherency replies from 
slave agents 1 04 (and optionally from home agent 1 02). 
When each of the coherency replies has been received, 40 
request agent 1 00 transitions to a new state depending 
upon the type of transaction which initiated the coher- 
ency activity. Additionally, request active state 142 may 
employ a timer for detecting that coherency replies have 
not be received within a predefined time-out period. If 
the timer expires prior to the receipt of the number of 
replies specified by home agent 1 02, then request agent 
1 00 transitions to an error state (not shown). Still further, 
certain embodiments may employ a reply indicating that 
a read transfer failed. If such a reply is received, request so 
agent 1 00 transitions to request ready state 142 to reat- 
tempt the read. 

[0086] If replies are received without error or time-out, 
then the state transitioned to by request agent 100 for 
read transactions is read complete state 1 46. It is noted 55 
that, for read transactions, one of the received replies 
may include the data corresponding to the requested co- 
herency unit. Request agent 100 reissues the read 



transaction upon SMP bus 20 and further transmits the 
coherency completion to home agent 1 02. Subsequent- 
ly, request agent 100 transitions to an idle state 148. A 
new transaction may then be serviced by request agent 
100 using the state machine depicted in Fig. 6. 
[0087] Conversely, write active state 1 50 and ignored 
write reissue state 152 are used for write transactions. 
Ignore signal 70 is not asserted for certain write trans- 
actions in computer system 10, even when coherency 
activity is initiated upon network 14. For example, I/O 
write transactions are not ignored. The write data is 
transferred to system interface 24, and is stored therein. 
Write active state 150 is employed for non-ignored write 
transactions, to allow for transfer of data to system in- 
terface 24 if the coherency replies are received prior to 
the data phase of the write transaction upon SMP bus 
20. Once the corresponding data has been received, re- 
quest agent 1 00 transitions to write complete state 154. 
During write complete state 1 54, the coherency comple- 
tion reply is transmitted to home agent 102. Subse- 
quently, request agent 100 transitions to idle state 148. 
[0088] Ignored write transactions are handled via a 
transition to ignored write reissue state 152. During ig- 
nored write reissue state 152, request agent 100 reis- 
sues the ignored write transaction upon SMP bus 20. In 
this manner, the write data may be transferred from the 
originating processor 16 and the corresponding write 
transaction released by processor 16. Depending upon 
whether or not the write data is to be transmitted with 
the coherency completion, request agent 100 transi- 
tions to either the ignored write active state 156 or the 
ignored write complete state 158. Ignored write active 
state 156, similar to write active state 150, is used to 
await data transfer from SMP bus 20. During ignored 
write complete state 158, the coherency completion is 
transmitted to home agent 102. Subsequently, request 
agent 100 transitions to idle state 148. From idle state 
148, request agent 100 transitions to request ready 
state 142 upon receipt of a transaction from SMP in 
queue 94. 

[0089] Turning next to Fig. 7, a flowchart 160 depict- 
ing an exemplary state machine for home agent 102 is 
shown. Home agents 102 may include multiple inde- 
pendent copies of the state machine represented by 
flowchart 1 60 in order to allow for processing of multiple 
outstanding requests to the home agent 102. However, 
the multiple outstanding requests do not affect the same 
coherency unit, according to one embodiment. 
[0090] Home agent 1 02 receives coherency requests 
in a receive request state 1 62. The request may be clas- 
sified as either a coherent request or an other transac- 
tion request. Other transaction requests may include I/ 
O read and I/O write requests, interrupt requests, and 
administrative requests, according to one embodiment. 
The non-coherent requests are handled by transmitting 
a transaction upon SMP bus 20, during a state 164. A 
coherency completion is subsequently transmitted. Up- 
on receiving the coherency completion, I/O write and ac- 
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cepted interrupt transactions result in transmission of a 
data transaction upon SMP bus 20 in the home node (i. 
e. data only state 1 65). When the data has been trans- 
ferred, home agent 102 transitions to idle state 166. Al- 
ternatively, I/O read, administrative, and rejected inter- 
rupted transactions cause a transition to Idle state 1 66 
upon receipt of the coherency completion. 
[0091] Conversely, home agent 102 transitions to a 
check state 168 upon receipt of a coherent request. 
Check state 168 is used to detect if coherency activity 
is in progress for the coherency unit affected by the co- 
herency request. If the coherency activity is in progress 
(i.e. the coherency information is blocked), then home 
agent 102 remains in check state 168 until the in- 
progress coherency activity completes. Home agent 
102 subsequently transitions to a set state 170. 
[0092] During set state 1 70, home agent 1 02 sets the 
status of the directory entry storing the coherency infor- 
mation corresponding to the affected coherency unit to 
blocked. The blocked status prevents subsequent activ- 
ity to the affected coherency unit from proceeding, sim- 
plifying the coherency protocol of computer system 1 0. 
Depending upon the read or write nature of the transac- 
tion corresponding to the received coherency request, 
home agent 102 transitions to read state 172 or write 
reply state 174. 

[0093] While in read state 172, home agent 102 is- 
sues coherency demands to slave agents 1 04 which are 
to be updated with respect to the read transaction. 
Home agent 102 remains in read state 172 until a co- 
herency completion is received from request agent 1 00, 
after which home agent 102 transitions to clear block 
status state 1 76. In embodiments in which a coherency 
request for a read may fail, home agent 102 restores the 
state of the affected directory entry to the state prior to 
the coherency request upon receipt of a coherency com- 
pletion indicating failure of the read transaction. 
[0094] During write state 1 74, home agent 1 02 trans- 
mits a coherency reply to request agent 100. Home 
agent 1 02 remains in write reply state 1 74 until a coher- 
ency completion is received from request agent 100. If 
data is received with the coherency completion, home 
agent 102 transitions to write data state 178. Alterna- 
tively, home agent 1 02 transitions to clear block status 
state 1 76 upon receipt of a coherency completion not 
containing data. 

[0095] Home agent 1 02 issues a write transaction up- 
on SMP bus 20 during write data state 178 in order to 
transfer the received write data. For example, a write 
stream operation (described below) results in a data 
transfer of data to home agent 102. Home agent 102 
transmits the received data to memory 22 for storage. 
Subsequently, home agent 102 transitions to clear 
blocked status state 1 76. 

[0096] Home agent 102 clears the blocked status of 
the coherency information corresponding to the coher- 
ency unit affected by the received coherency request in 
clear block status state 1 76. The coherency information 



may be subsequently accessed. The state found within 
the unblocked coherency information reflects the coher- 
ency activity initiated by the previously received coher- 
ency request. After clearing the block status of the cor- 

s responding coherency information, home agent 102 
transitions to idle state 166. From idle state 166, home 
agent 1 02 transitions to receive request state 1 62 upon 
receipt of a coherency request. 
[0097] Turning now to Fig. 8, a flowchart 1 80 is shown 

10 depicting an exemplary state machine for slave agents 
1 04. Slave agent 1 04 receives coherency demands dur- 
ing a receive state 1 82. In response to a coherency de- 
mand, slave agent 104 may queue a transaction for 
presentation upon SMP bus 20. The transaction causes 

is a state change in caches 1 8 and caches internal to proc- 
essors 16 in accordance with the received coherency 
demand. Slave agent 1 04 queues the transaction during 
send request state 184. 

[0098] During send reply state 186, slave agent 104 

20 transmits a coherency reply to the request agent 100 
which initiated the transaction. It is noted that, according 
to various embodiments, slave agent 1 04 may transition 
from send request state 1 84 to send reply state 1 86 up- 
on queuing the transaction for SMP bus 20 or upon suc- 

25 cessful completion of the transaction upon SMP bus 20. 
Subsequent to coherency reply transmittal, slave agent 
1 04 transitions to an idle state 1 88. From idle state 1 88, 
slave agent 104 may transition to receive state 182 upon 
receipt of a coherency demand. 

30 [0099] Turning now to Figs. 9-12, several tables are 
shown listing exemplary coherency request types, co- 
herency demand types, coherency reply types, and co- 
herency completion types. The types shown in the ta- 
bles of Figs. 9-1 2 may be employed by one embodiment 

35 of computer system 1 0. Other embodiments may em- 
ploy other sets of types. 

[0100] Fig. 9 is a table 190 listing the types of coher- 
ency requests. A first column 192 lists a code for each 
request type, which is used in Fig. 13 below. A second 

40 column 194 lists the coherency requests types, and a 
third column 196 indicates the originator of the coher- 
ency request. Similar columns are used in Figs. 10-12 
for coherency demands, coherency replies, and coher- 
ency completions. An "FT indicates request agent 100; 

45 an "S" indicates slave agent 104; and an "H" indicates 
home agent 102. 

[0101] A read to share request is performed when a 
coherency unit is not present in a particular SMP node 
and the nature of the transaction from SMP bus 20 to 

so the coherency unit indicates that read access to the co- 
herency unit is desired. For example, a cacheable read 
transaction may result in a read to share request. Gen- 
erally speaking, a read to share request is a request for 
a copy of the coherency unit in the shared state. Simi- 

55 larly, a read to own request is a request for a copy of the 
coherency unit in the owned state. Copies of the coher- 
ency unit in other SMP nodes should be changed to the 
invalid state. A read to own request may be performed 
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in response to a cache miss of a cacheable write trans- 
action, for example. 

[01 02] Read stream and write stream are requests to 
read or write an entire coherency unit. These operations 
are typically used for block copy operations. Processors 5 
16 and caches 18 do not cache data provided in re- 
sponse to a read stream or write stream request. In- 
stead, the coherency unit is provided as data to the proc- 
essor 16 in the case of a read stream request, or the 
data is written to the memory 22 in the case of a write 
stream request. It is noted that read to share, read to 
own, and read stream requests may be performed as 
COMA operations (e.g. RTS, RTO, and RS) or as NUMA 
operations (e.g. RTSN, RTON, and RSN). 
[0103] A write back request is performed when a co- 
herency unit is to be written to the home node of the 
coherency unit. The home node replies with permission 
to write the coherency unit back. The coherency unit is 
then passed to the home node with the coherency com- 
pletion. 

[0104] The invalidate request is performed to cause 
copies of a coherency unit in other SMP nodes to be 
invalidated. An exemplary case in which the invalidate 
request is generated is a write stream transaction to a 
shared or owned coherency unit. The write stream 
transaction updates the coherency unit, and therefore 
copies of the coherency unit in other SMP nodes are 
invalidated. 

[0105] I/O read and write requests are transmitted in 
response to I/O read and write transactions. I/O trans- 
actions are non-coherent (i.e. the transactions are not 
cached and coherency is not maintained for the trans- 
actions). I/O block transactions transfer a larger portion 
of data than normal I/O transactions. In one embodi- 
ment, sixty-four bytes of information are transferred in 
a block I/O operation while eight bytes are transferred 
in a non-block I/O transaction. 

[01 06] Flush requests cause copies of the coherency 
unit to be invalidated. Modified copies are returned to 
the home node. Interrupt requests are used to signal in- 
terrupts to a particular device in a remote SMP node. 
The interrupt may be presented to a particular processor 
16, which may execute an interrupt service routine 
stored at a predefined address in response to the inter- 
rupt. Administrative packets are used to send certain 
types of reset signals between the nodes. 
[0107] Fig. 10 is a table 198 listing exemplary coher- 
ency demand types. Similar to table 1 90, columns 1 92, 
194, and 196 are included in table 198. A read to share 
demand is conveyed to the owner of a coherency unit, 
causing the owner to transmit data to the requesting 
node. Similarly, read to own and read stream demands 
cause the owner of the coherency unit to transmit data 
to the requesting node. Additionally, a read to own de- 
mand causes the owner to change the state of the co- 
herency unit in the owner node to invalid. Read stream 
and read 'to share demands cause a state change to 
owned (from modified) in the owner node. 



[0108] Invalidate demands do not cause the transfer 
of the corresponding coherency unit. Instead, an inval- 
idate demand causes copies of the coherency unit to be 
invalidated. Finally, administrative demands are con- 
veyed in response to administrative requests. It is noted 
that each of the demands are initiated by home agent 
1 02, in response to a request from request agent 1 00. 
[0109] Fig. 11 is a table 200 listing exemplary reply 
types employed by one embodiment of computer sys- 
tem 10. Similar to Figs. 9 and 10, Fig. 11 includes col- 
umns 192, 194, and 196 for the coherency replies. 
[01 10] A data reply is a reply including the requested 
data. The owner slave agent typically provides the data 
reply for coherency requests. However, home agents 
may provide data for I/O read requests. 
[0111] The acknowledge reply indicates that a coher- 
ency demand associated with a particular coherency re- 
quest is completed. Slave agents typically provide ac- 
knowledge replies, but home agents provide acknowl- 
edge replies (along with data) when the home node is 
the owner of the coherency unit. 
[01 1 2] Slave not owned, address not mapped and er- 
ror replies are conveyed by slave agent 104 when an 
error is detected. The slave not owned reply is sent if a 
slave is identified by home agent 102 as the owner of a 
coherency unit and the slave no longer owns the coher- 
ency unit. The address not mapped reply is 'sent if the 
slave receives a demand for which no device upon the 
corresponding SMP bus 20 claims ownership. Other er- 
ror conditions detected by the slave agent are indicated 
via the error reply. 

[01 1 3] In addition to the error replies available to slave 
agent 104, home agent 102 may provide error replies. 
The negative acknowledge (NACK) and negative re- 
sponse (NOPE) are used by home agent 1 02 to indicate 
that the corresponding request is does not require serv- 
ice by home agent 102. The NACK transaction may be 
used to indicate that the corresponding request is reject- 
ed by the home node. For example, an interrupt request 
receives a NACK if the interrupt is rejected by the re- 
ceiving node. An acknowledge (ACK) is conveyed if the 
interrupt is accepted by the receiving node. The NOPE 
transaction is used to indicate that a corresponding flush 
request was conveyed for a coherency unit which is not 
stored by the requesting node. 
[01 1 4] Fig. 1 2 is a table 202 depicting exemplary co- 
herency completion types according to one embodiment 
of computer system 10. Similar to Figs. 9-11, Fig. 12 in- 
cludes columns 192, 194, and 196 for coherency com- 
pletions. 

[0115] A completion without data is used as a signal 
from request agent 100 to home agent 102 that a par- 
ticular request is complete. In response, home agent 
1 02 unblocks the corresponding coherency information. 
Two types of data completions are included, corre- 
sponding to dissimilar transactions upon SMP bus 20. 
One type of reissue transaction involves only a data 
phase upon SMP bus 20. This reissue transaction may 
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be used for I/O write and interrupt transactions, in one 
embodiment. The other type of reissue transaction in- 
volves both an address and data phase. Coherent 
writes, such as write stream and write back, may employ 
the reissue transaction including both address and data 
phases. Finally, a completion indicating failure is includ- 
ed for read requests which fail to acquire the requested 
state. 

[0116] Turning next to Fig. 13, a table 210 is shown 
depicting coherency activity in response to various 
transactions upon SM P bus 20. Table 21 0 depicts trans- 
actions which result in requests being transmitted to oth- 
er SMP nodes 12. Transactions which complete within 
an SMP node are not shown. A "-" in a column indicates 
that no activity is performed with respect to that column 
in the case considered within a particular row. A trans- 
action column 212 is included indicating the transaction 
received upon SMP bus 20 by request agent 1 00. MTAG 
column 214 indicates the state of the MTAG for the co- 
herency unit accessed by the address corresponding to 
the transaction. The states shown include the MOSI 
states described above, and an "n M state. The "n" state 
indicates that the coherency unit is accessed in NUMA 
mode for the SMP node in which the transaction is ini- 
tiated. Therefore, no local copy of the coherency unit is 
stored in the requesting nodes memory. Instead, the co- 
herency unit is transferred from the home SMP node (or 
an owner node) and is transmitted to the requesting 
processor 1 6 or cache 1 8 without storage in memory 22. 
[0117] A request column 216 lists the coherency re- 
quest transmitted to the home agent identified by the 
address of the transaction. Upon receipt of the coher- 
ency request listed in column 216, home agent 102 
checks the state of the coherency unit for the requesting 
node as recorded in directory 66. D column 21 8 lists the 
current state of the coherency unit recorded for the re- 
questing node, and D' column 220 lists the state of the 
coherency unit recorded for the requesting node as up- 
dated by home agent 102 in response to the received 
coherency request. Additionally, home agent 102 may 
generate a first coherency demand to the owner of the 
coherency unit and additional coherency demands to 
any nodes maintaining shared copies of the coherency 
unit. The coherency demand transmitted to the owner 
is shown in column 222, while the coherency demand 
transmitted to the sharing nodes is shown in column 
224. Still further, home agent 1 02 may transmit a coher- 
ency reply to the requesting node. Home agent replies 
are shown in column 226. 

[01 1 8] The slave agent 1 04 in the SM P node indicated 
as the owner of the coherency unit transmits a coheren- 
cy reply as shown in column 228. Slave agents 104 in 
nodes indicated as sharing nodes respond to the coher- 
ency demands shown in column 224 with the coherency 
replies shown in column 230, subsequent to performing 
state changes indicated by the received coherency de- 
mand. 

[0119] Upon receipt of the appropriate number of co- 



herency replies, request agent 100 transmits a coher- 
ency completion to home agent 102. The coherency 
completions used for various transactions are shown in 
column 232. 

5 [0120] As an example, a row 234 depicts the coher- 
ency activity in response to a read to share transaction 
upon SMP bus 20 for which the corresponding MTAG 
state is invalid. The corresponding request agent 100 
transmits a read to share coherency request to the home 

10 node identified by the global address associated with 
the read to share transaction. For the case shown in row 
234, the directory of the home node indicates that the 
requesting node is storing the data in the invalid state. 
The state in the directory of the home node for the re- 

15 questing node is updated to shared, and read to share 
coherency demand is transmitted by home agent 1 02 to 
the node indicated by the directory to be the owner. No 
demands are transmitted to sharers, since the transac- 
tion seeks to acquire the shared state. The slave agent 

20 1 04 in the owner node transmits the data corresponding 
to the coherency unit to the requesting node. Upon re- 
ceipt of the data, the request agent 100 within the re- 
questing node transmits a coherency completion to the 
home agent 1 02 within the home node. The transaction 

25 js therefore complete. 

[0121] It is noted that the state shown in D column 21 8 
may not match the state in MTAG column 214. For ex- 
ample, a row 236 shows a coherency unit in the invalid 
state in MTAG column 214. However, the corresponding 

30 state in D column 218 may be modified, owned, or 
shared. Such situations occur when a prior coherency- 
request from the requesting node for the coherency unit 
is outstanding within computer system 1 0 when the ac- 
cess to MTAG 68 for the current transaction to the co- 

35 herency unit is performed upon address bus 58. How- 
ever, due to the blocking of directory entries during a 
particular access, the outstanding request is completed 
prior to access of directory 66 by the current request. 
For this reason, the generated coherency demands are 

40 dependent upon the directory state (which matches the 
MTAG state at the time the directory is accessed). For 
the example shown in row 236, since the directory indi- 
cates that the coherency unit now resides in the request- 
ing node, the read to share request may be completed 

45 by simply reissuing the read transaction upon SMP bus 
20 in the requesting node. Therefore, the home node 
acknowledges the request, including a reply count of 
one, and the requesting node may subsequently reissue 
the read transaction. It is further noted that, although 

so table 210 lists many types of 'transactions, additional 
transactions may be employed according to various em- 
bodiments of computer system 1 0. 

Fast Write Stream Operations 

55 

[0122] Turning now to Fig. 14, a diagram depicting a 
local physical address space 300 in accordance with 
one embodiment of computer system 1 0 is shown. Gen- 
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erally speaking, an address space identifies a storage 
location corresponding to each of the possible address* 
es within the address space. The address space may 
assign additional properties to certain addresses within 
the address space. In one embodiment, addresses with- 
in local physical address space 300 include 41 bits. 
[0123] As shown in Fig. 14, local physical address 
space 300 includes an LPA region 302 and an LPA^ 
region 304. LPA region 302 allows read and write trans- 
actions to occur to the corresponding storage locations 
once a coherency state is acquired consistent with the 
transaction. In other words, no additional properties are 
assigned to addresses within LPA region 302. In one 
embodiment, LPA region 302 is the set of addresses 
within address space 300 having most significant bits 
(MSBs) equal to OxxOO (represented in binary). The "xx" 
portion of the MSBs identifies the SMP node 12 which 
serves as the home node for the address. For example, 
xx=00 may identify SMP node 12A; xx=01 may identify 
SMP node 12B, etc. The address is a local physical ad- 
dress within LPA region 302 if the "xx" portion identifies 
the SMP node 12 containing the processor 16 which 
performs the transaction corresponding to the address. 
Otherwise, the address is a global address. Additionally, 
the global address is a local physical address within an- 
other SMP node 12. 

[01 24] Addresses within LPA^ region 304 refer to the 
same set of storage locations to which addresses within 
LPA region 302 refer. For example, an address "A" with- 
in LPA region 302 may refer to a storage location 306 
storing a datum "B". The address "A" within LPA^ region 
304 also refers to storage location 306 storing datum 
"B". For this example, address "A" refers to the bits of 
the address exclusive of the bits identifying LPA^ region 
304 and LPA region 302 (e.g. the least significant 36 
bits, in one embodiment). In one embodiment, LPA^ re- 
gion 304 is the set of addresses having MSBs equal to 
0xx1 0 (represented in binary). The "xx" field is interpret- 
ed as described above. It is noted that having two or 
more regions of addresses within an address space 
identifying the same set of storage locations is referred 
to as aliasing. 

[01 25] I n contrast to the transactions permitted to LPA 
region 302, read transactions are not permitted to LPA^ 
region 304. Write transactions are permitted to LPA^ 
region 304, In one particulat embodiment, write stream 
transactions are permitted to LPA^ region 304 while 
other write transactions are not permitted. 
[0126] System interface 24 recognizes the write op- 
eration to LPA^ region 304 as a "fast write" write oper- 
ation. Instead of first acquiring a coherency state for the 
affected coherency unit consistent with performing a 
write operation and then subsequently transferring the 
data from the initiating processor, system interface 24 
allows transfer of the data to system interface 24 prior 
to completing the requisite coherency operation. In oth- 
er words, system interface 24 does not assert the ignore 
signal 70 for write operations having an address in 



LPAf W region 304 due to a lack of proper coherency state 
to perform a write. The write operation to the LPA^ ad- 
dress region may thereby appear to the issuing proces- 
sor 1 6 to complete before the obtaining of the write per- 
5 mission by SMP node 12 has been globally ordered. 
Processor resources are freed more rapidly than if the 
coherency state is acquired prior to receiving the data 
from the processor. 

[01 27] Addresses within LPA^ region 304 are there- 
to fore assigned the additional property that write opera- 
tions performed to LPA^ region 304 are performed us- 
ing a fast write protocol. Write operations using the fast 
write protocol may be completed out of order with re- 
spect to the other operations performed within the local 
*5 SMP node 12. It is noted that other combinations of the 
MSBs within LPA address space 300 may be used to 
assign other additional properties. 
[0128] Generally speaking, a "fast write" write opera- 
tion may be completed out of order with respect to the 
20 surrounding operations. Still further, the "fast write" write 
operation is effectively completed outside of the global 
ordering of computer system 10 since the operation is 
completed in the local node prior to acquiring a coher- 
ency state consistent with performing a write operation. 
25 Therefore, the order generally applied to transactions 
upon SMP bus 20 is overridden via the fast write proto- 
col. Although in the embodiment described certain bits 
of the address of a 'last write" write operation form the 
specific encoding identifying the "fast write" write oper- 
30 ation, other formats of the "fast write" write operation are 
contemplated. For example, control signals upon ad- 
dress bus 58 (shown in Fig. 2) identify the type of trans- 
action being presented upon address bus 58. Additional 
encodings of the control signals may be defined to indi- 
35 cate that a "fast write" write transaction is being per- 
formed instead of using MSBs of the address presented. 
Still further, instead of using a write stream instruction 
to perform fast writes, a new instruction may be defined. 
The new instruction expressly indicates that a "fast 
40 write" write operation is to be performed. Processor 1 6 
may be designed to perform the fast write instruction by 
presenting a "fast write" write transaction upon address 
bus 58. 

[01 29] Turning now to Fig. 1 5, a flow chart 31 0 depict- 
45 ing processing of transactions received by system inter- 
face 24 is shown according to one embodiment of sys- 
tem interface 24. When a transaction is detected, sys- 
tem interface 24 determines if the transaction is a read 
or write transaction (decision box 312). If a read trans- 
so action is detected, then read processing is performed 
by system interface 24 in accordance with Fig. 13 (step 
314). Alternatively, when a write transaction is detected, 
system interface 24 determines if a write stream trans- 
action having an address within LPA^ region 304 is con- 
55 veyed (decision box 31 6). In other words, system inter- 
face 24 determines if a write operation having a fast 
write encoding is performed. If a non-fast write transac- 
tion is detected, system interface 24 processes the write 
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operation as described with respect to Fig. 13 (step 
318). If a write stream transaction to LPA^ region 304 
is detected, steps 320, 322, and 324 are performed. 
[0130] A fast write transaction may be performed in 
either NUMA mode (when the "xx" field specifies an 5 
SMP node 12A-12D other than the SMP node 12A-12D 
in which the fast write transaction is generated) or in CO- 
MA mode. As mentioned above, NUMA mode is select- 
ed by coding the global address into MMUs 76 while CO- 
MA mode is selected by coding a local physical address 10 
into MMUs 76. Fast write transactions may be particu- 
larly beneficial for NUMA mode, where no MTAG is 
present in system interface 24. Because no MTAG is 
present, the access rights of the node to the affected 
coherency unit cannot be determined within the node. '5 
Therefore, coherency activity is performed for a write 
transaction in NUMA mode even if no other node is 
maintaining a copy of the affected coherency unit. Fast 
write transactions allow this coherency activity to occur 
concurrent with transfer of the data from the initiating 20 
processor, thereby freeing local node resources more 
quickly than if the same NUMA write transaction were 
performed using a non-fast write encoding. 
[0131] As shown in step 320, system interface 24 
queues the fast write operation within system interface 25 
24. In one embodiment, the fast write operation is 
queued in SMP in queue 94 as shown in Fig. 3. The 
ignore signal 70 is not asserted upon address bus 58, 
regardless of the state of the affected coherency unit 
within MTAG 68. Conversely, a non-fast write operation 30 
affecting a coherency unit for which MTAG 68 is storing 
the invalid, shared, or owned state receives an asserted 
ignore signal 70. After acquiring write access to the co- 
herency unit, system interface 24 reissues the non-fast 
write operation and the operation may complete at that 35 
time. 

[01 32] Since ignore signal 70 is not asserted upon the 
fast write transaction, the corresponding data is subse- 
quently provided by processor 16 upon data bus 60 
(shown in Fig. 2). During step 322, the data is received 40 
and stored by system interface 24. The write operation 
is thereby complete with respect to the initiating proces- 
sor 16. 

[0133] Step 324 indicates that coherency operations 
are performed to process the write operation at the glo- 45 
bal level. It is noted that step 324 may be initiated upon 
receipt of the write operation. Therefore, steps 322 and 
324 may be performed in parallel. 
[0134] Turning now to Fig. 16, a block diagram of a 
portion of one embodiment of computer system 1 0 is 50 
shown to further illustrate performance of write opera- 
tions using the fast write protocol in computer system 
10. Fig. 16 depicts processors 16A and 16B, although 
additional processors 16 may be included. Processors 
1 6 include respective write stream buffers 330 (such as 55 
write stream buffer 330 A with in processor 16A and write 
stream buffer 330B within processor 16B). External 
caches 18 are shown coupled between processors 16 



34 

and SMP bus 20. However, external caches 18 are by- 
passed by write stream operations. Therefore external 
caches 1 8 are shown as dashed elements. Additionally, 
system interface 24 is shown coupled to SMP bus 20. 
Within system interface 24, SMP in queue 94 and re- 
quest agent 100 are shown. 

[01 35] Write stream buffers 330 are included in proc- 
essors 16 for storing write stream operations prior to 
their completion upon SMP bus 20. The address to be 
written by the write stream operation may be stored, as 
well as the corresponding data. When the address has 
been presented upon SMP bus 20 and the correspond- 
ing data has been transferred, the write stream buffer 
330 is available for storing a subsequent write stream 
operation. Typically, processors 16 are configured to 
support a small- number of outstanding write stream op- 
erations. For example, one write stream buffer 330 may 
be included in each processor 16. Therefore, if multiple 
write stream operations are to be performed within a rel- 
atively short period of time, processor 16 may stall in- 
struction execution until the write stream operations are 
stored into write stream buffers 330. 
[0136] Even in embodiments of computer system 10 
including address controller 52 and data controller 54, 
a similar problem exists. Storage locations within ad- 
dress controller 52 and data controller 54 are allocated 
to the write stream operation, and these storage loca- 
tions are not freed until the write stream operation is 
completed upon SMP bus 20. Additionally, if a write 
stream operation receives an asserted ignore signal 
from system bus 24 (i.e. it is not a fast write operation), 
then subsequent transactions from that address control- 
ler are also ignored. Therefore, transactions of all types 
may be impeded by write stream operations which do 
not use the fast write protocol. 
[0137] System interface 24, on the other hand, in- 
cludes SMP in queue 94. SMP in queue 94 may be much 
larger than the buffers included within processors 16, 
storing a significantly larger number of transactions. In 
one embodiment, SMP in queue 94 includes 128 stor- 
age locations for transactions. Storage locations within 
output data queue 90 (shown in Fig. 2) correspond to 
storage locations within SMP in queue 94 and store the 
data corresponding to write operations within SMP in 
queue 94. Request agent 1 00 selects transactions from 
SMP in queue 94 for which to perform coherency oper- 
ations, and transmits the coherency operations upon 
network 14. 

[01 38] Due to the larger number of storage locations 
within SMP in queue 94, a large number of fast write 
stream operations may be queued therein. Since the 
fast write stream transactions are completed from proc- 
essors 1 6 by storing the transaction into SMP in queue 
94 and the corresponding data within output data queue 
90, processors 16 may continue with other operations 
while system interface 24 completes the write stream 
operations. 

[0139] Turning nextto Fig. 17, a diagram depicting co- 
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herency activities performed in response to a fast write 
stream operation is shown according to one embodi- 
ment of computer system 10. A request agent 100, a 
home agent 102, and an owner slave agent 104A, and 
a sharing slave agent 104B are shown in Fig. 18. Re- 5 
quest agent 100, upon receipt of a write stream trans- 
action having an LPA fw address, transmits a write 
stream request to the home node identified by the GA 
translated from the LPA^ address (reference number 
340). Alternatively, the write stream operation may be 
presented upon SMP bus 20 using a global address 
identifying fast write protocol via the most significant 
bits. In one embodiment, the write stream request is 
conveyed regardless of the coherency state stored in 
MTAG 68 within the requesting node. 
[0140] Upon receipt of the write stream request from 
request agent 100, a home agent 102 determines the 
owner and any sharers of the requested coherency unit. 
The home agent 1 02 transmits an invalidate demand to 
the owner slave 1 04A and to the sharing slave(s) 1 04B 
(reference numbers 342 and 344, respectively). In this 
manner, copies of the coherency unit updated by the 
write stream operation within any slave nodes are inval- 
idated. The write stream operation updates each byte 
within the coherency unit. Therefore, the copies main- 
tained by slaves 104 are invalid upon completion of the 
write stream coherency operation. 
[0141] Slave agents 104 receive the invalidate de- 
mands, and transmit a acknowledge replies to request 
agent 100 (reference numbers 346 and 348). Addition- 
ally, the slave agents 104 invalidate their copies of the 
coherency unit. 

[0142] Upon receipt of the acknowledge replies from 
each of the slave agents 104, request agent 1 00 trans- 
mits a coherency completion with data to home agent 
1 02 (reference number 350). The data transmitted is the 
data received from the processor 1 6 which initiated the 
fast write stream transaction. It is noted that, if a copy 
of the coherency unit updated by the fast write stream 
transaction is stored in the memory 22 corresponding to 
the SMP node 12 including the initiating processor 16, 
the copy is invalidated (similar to any other slave copy). 
[0143] Turning next to Fig. 18, a timing diagram is 
shown depicting transactions performed upon SMP bus 
20 to perform a write stream operation in one embodi- 
ment of computer system 1 0. Address bus 58 transac- 
tions are shown, as well as data bus 60 transactions. 
[0144] Upon execution of a write stream instruction, 
a processor 1 6 performs a write stream transaction up- 
on address bus 58 (reference number 360). System in- 
terface 24 examines the coherency state of the affected 
coherency unit (i.e. the coherency unit including ad- 
dress "A") within MTAG 68. If the SMP node 1 2 has write 
permission to the coherency unit (e.g. the modified 
state), system interface 24 allows the write stream op- 
eration to complete. However, if write permission is not 
stored in MTAG 68, system interface 24 asserts the ig- 
nore signal as shown in Fig. 1 8 (reference number 362). 



System interface 24 proceeds with coherency opera- 
tions to acquire write permission to the affected coher- 
ency unit. A significant amount of time may elapse be- 
tween the ignoring of write stream transaction 360 and 
a subsequent reissue of the write stream transaction 
(reference number 364). System interface 24 reissues 
the write stream transaction upon acquiring write per- 
mission to the affected coherency unit. Upon detection 
of the reissue, processor 16 conveys the data corre- 
sponding to write stream transaction 360 (reference 
number 366) in accordance with the bus protocol of 
SMP bus 20. Once the data is transferred, the processor 
16 resources employed to store and perform the write 
stream transaction are freed for use by another trans- 
action. A processor 16 supporting only one outstanding 
write stream transaction may now initiate a second write 
stream operation to an address B (reference number 
368). . 

[0145] Conversely, Fig. 1 9 shows a timing diagram of 
a fast write stream operation as performed by one em- 
bodiment of computer system 1 0. Address bus 58 trans- 
actions are shown, as well as data bus 60 transactions. 
[0146] Similar to Fig. 18, a processor 16 performs a 
write stream transaction 370 upon address bus 58 upon 
execution of a write stream instruction. However, the 
write stream transaction in Fig. 1 9 is performed using 
the fast write stream encoding. Regardless of the state 
of the updated coherency unit in MTAG 68, system in- 
terface 24 does not assert the ignore signal 70 (refer- 
ence number 372). Subsequently, the data correspond- 
ing to the fast write stream transaction 370 is transferred 
upon data bus 60. The processor 16 resources used to 
store and perform the fast write stream transaction are 
freed rapidly, allowing the resources to be used for sub- 
sequent transactions such as another write stream op- 
eration (reference number 376). Advantageously, the 
protocol and traffic upon SMP bus 20 determines the 
time period for which processor resources are occupied 
by the fast write stream transaction. Conversely, write 
stream transactions as shown in Fig. 1 8 occupy proc- 
essor resources for a time period determined by the la- 
tency of the corresponding coherency operations per- 
formed upon network 14. 

[0147] Although SMP nodes 12 have been described 
in the above exemplary embodiments, generally speak- 
ing an embodiment of computer system 1 0 may include 
one or more processing nodes. As used herein, a 
processing node includes at least one processor and a 
corresponding memory. Additionally, circuitry for com- 
municating with other processing nodes is included. 
When more than one processing node is included in an 
embodiment of computer system 1 0, the corresponding 
memories within the processing nodes form a distribut- 
ed shared memory. A processing node may be referred 
to as remote or local. A processing node is a remote 
processing node with respect to a particular processor 
if the processing node does not include the particular 
processor. Conversely, the processing node which in- 
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eludes the particular processor is that particular proces- 
sor's local processing node. Still further, the term "co- 
herency operation", as used herein, refers to a combi- 
nation of coherency requests, coherency demands, co- 
herency replies, and coherency completions employed 
to acquire a particular coherency state in the processing 
node within which a transaction is initiated which causes 
the coherency state to be desired in the processing 
node. 

[0148] In accordance with the above disclosure, a 
computer system has been described which performs 
efficient write operations . Processor resources are freed 
upon transmission of the write operation and corre- 
sponding data to the system interface, before an appro- 
priate coherency state is acquired by the node contain- 
ing the processor. The ordering of transactions within 
the node is not maintained for the write operations, but 
the operations are cleared from the processor more rap- 
idly. Advantageously, the processor resources are avail- 
able for use by subsequent transactions while coheren- 
cy operations are performed in response to the write 
transactions. Ordinarily, these processor resources 
would be occupied by the write transaction. As a result, 
computer system performance may be increased to the 
extent that the more rapidly freed resources may be 
used for subsequent transactions during performance 
of the corresponding coherency operations. 



Claims 

1 . A method for performing coherent write operations 
in a distributed shared memory multiprocessor 
computer system that includes a plurality of 
processing nodes interconnected by a network, 
each processing node including at least one proc- 
essor, memory and a system interface interfacing 
said processing node to said network and providing 
internode coherency, the method comprising: 

a processor within a local processing node of 
said multiprocessing computer system initiat- 
ing a write operation (312) to a coherency unit, 
said coherency unit having a home processing 
node; 

said system interface of said local processing 
node performing a coherency operation (324) 
to said home processing node in response to 
said write operation for invalidating any slave 
copies of said coherency unit; 
said system interface permitting transferring of 
data (322) corresponding to said write opera- 
tion from said processor to said system inter- 
face prior to receiving confirmation that said 
slave copies have been invalidated if said write 
operation includes a specific predefined encod- 
ing (316) indicative that said write operation is 
a fast write operation, said fast write operation 
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being to an entire coherency unit, and on re- 
ceiving said confirmation, transmitting coher- 
ency completion to said home processing 
node, and 

5 said system interface inhibiting transferring of 

said data from said processor until said confir- 
mation has been received if said write operation 
includes a different encoding than said specific 
predefined encoding (318). 

w 

2. The method as recited in claim 1 wherein said spe- 
cific predefined encoding is provided via an address 
included with said write operation. 

15 3. The method as recited in claim 2 wherein said ad- 
dress lies within a first address region which is with- 
in an address space of said local processing node. 

4. The method as recited in claim 3 wherein said first 
20 address region is identified by a particular value 

within a plurality of most significant bits of said ad- 
dress. 

5. The method as recited in claim 3 wherein said first 
25 address region is an alias for a second address re- 
gion within said address space. 

6. The method as recited in claim 4 wherein said en- 
coding different than said specific predefined en- 

30 coding comprises a second address within a sec- 
ond address region. 

7. The method as recited in any of claims 3 to 6 where- 
in said write operation is a write stream operation. 

35 

8. The method as recited in any of claims 3 to 7 further 
comprising translating said address to a global ad- 
dress prior to said performing said coherency oper- 
ation. 

40 

9. The method as recited in any preceding claim com- 
prising obtaining a coherency state which grants 
write permission. 

45 10. The method as recited in claim 9 further comprising 
transferring said data to a home node of said ad- 
dress upon obtaining said write permission. 

1 1 . A processing node for a distributed shared memory 
50 multiprocessor computer system that includes a 
plurality of such processing nodes interconnected 
by a network, said processing node comprising at 
least one processor, memory and a system inter- 
face interfacing said processing node to said net- 
55 work and providing internode coherency, wherein: 

a said processor (16A) in a local processing 
node is configured to initiate a write operation 
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to a coherency unit, 

said coherency unit having a home processing 
node; 

said system interface (24) in said local process- 
ing node is coupled to receive said write oper- 
ation and to perform a coherency operation to 
said home processing node in response to said 
write operation for invalidating any slave copies 
of said coherency unit, 
said system interface being further operable 

to permit said processor to transfer data 
corresponding to said write operation from said 
processor to said system interface prior to re- 
ceiving confirmation that said slave copies 
have been invalidated if said write operation in- 
cludes a specific predefined encoding indica- 
tive that said write operation is a fast write op- 
eration , said fast write operation being to an en- 
tire coherency unit, and an receiving said con- 
firmation, to transmit said coherency comple- 
tion to said home processing node, and 

to inhibit said processor from transferring 
said data until said confirmation is received if 
said write operation includes a different encod- 
ing than said specific predefined encoding. 

1 2. The apparatus as recited in claim 1 1 comprising ob- 
taining a coherency state which grants write permis- 
sion to said coherency unit identified by said write 
operation. 

13. The apparatus as recited in claim 11 wherein said 
specific predefined encoding is provided via an ad- 
dress included with said write operation. 

14. The apparatus as recited in claim 13 wherein said 
address lies within a first address region within an 
address space of accessible to said processor. 

15. The apparatus as recited in claim 14 wherein said 
first address region is an alias to a second address 
region within said address space, and wherein said 
different encoding comprises a second address ly- 
ing within said second address region. 

16. A computer system, comprising: 

a first processing node including apparatus ac- 
cording to any one of claims 11 to 15; and 
a second processing node comprising a sec- 
ond processor and a second system interface, 
wherein the second processing node is config- 
ured as said home node of said coherency unit, 
and wherein said second processing node is 
coupled to receive a coherency request corre- 
sponding to said coherency operation from said 
first processing node. 



1 7. The computer system as recited in claim 1 6 wherein 
said predefined encoding comprises an address 
within an address region of an address space cor- 
responding to said first processing node. 

5 

1 8. The computer system as recited in claim 1 7 wherein 
said address region is an alias to a second address 
region within said address space. 

10 1 9. The computer system as recited in any of claims 1 6 
to 18 wherein said first processing node provides 
data corresponding to said write operation to said 
second processing node upon completion of said 
coherency request. 

15 

PatentansprOche 

1. Verfahrenzum Durchfuhren koharenterSchreibvor- 
20 gange in einem Mehrprozessorcomputersystem mit 
verteiltem, gemeinsam verwendetem Speicher, wo- 
bei das Computersystem eine Mehrzahl von Pro- 
zessorknoten aufweist, die durch ein Netzwerk mit- 
einander verbunden sind, jeder Prozessorknoten 
25 zumindest einen Prozessor, einen Speicher und ei- 
ne Systemschnittstelle aufweist, die den Prozess- 
orknoten mit dem Netzwerk verbindet und Koha- 
renz zwischen den Knoten bereitstellt, 
wobei das Verfahren aufweist: 

30 

einen Prozessor innerhalb eines lokalen Pro- 
zessorknotens des Mehrprozessorcomputer- 
systems, welcher einen Schreibvorgang (312) 
an eine Koharenzeinheit auslost, wobei die Ko- 
35 harenzeinheit einen Heimat-Prozessorknoten 

hat, wobei 

die Systemschnittstelle des lokalen Prozessor- 
knotens einen Koharenzvorgang (324) mit dem 
Heimat-Prozessorknoten in Reaktion auf den 

^0 Schreibvorgang ausfuhrt, urn irgendwelche ab- 

hangigen Kopien (Slave-Kopien) der Koharen- 
zeinheit ungultig zu machen, 
die Systemschnittstelle das Ubertragen von 
Daten (322), welche dem Schreibvorgang ent- 

45 sprechen, von dem Prozessor zu der System- 

schnittstelle vor der Bestatigung erlaubt, daB 
die abhangigen Kopien ungultig gemacht wor- 
den sind, wenn der Schreibvorgang eine spe- 
zielle, vorbestimmte Codierung (31 6) aufweist, 

50 die anzeigt, daB der Schreibvorgang ein 

schneller Schreibvorgang ist, wobei der schnel- 
le Schreibvorgang auf eine gesamte Koharen- 
zeinheit gerichtet ist, und wobei nach dem 
Empfang der Bestatigung der AbschluB der Ko- 

55 harenz an den Heimat-Prozessorknoten uber- 

mittelt wird, und wobei 

die Systemschnittstelle das Ubertragen der Da- 
ten von dem Prozessor verhindert, bis die Be- 
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statigung empfangen worden ist, wenn der 
Schreibvorgang eine andere Codierung als die 
spezielle, vorbestimmte Codierung (318) auf- 
weist. 

2. Verfahren nach Anspruch 1, wobei die spezielle, 
vorbestimmte Codierung uber eine Adresse bereit- 
gestellt wird, die in dem Schreibvorgang enthalten 
ist. 

3. Verfahren nach Anspruch 2, wobei die Adresse in- 
nerhalb eines ersten AdreBbereiches liegt, der in- 
nerhalb eines AdreBraumes und des lokalen Pro- 
zessorknotens liegt. 

4. Verfahren nach Anspruch 3. wobei der erste 
AdreBbereich durch einen bestimmten Wert inner- 
halb einer Mehrzahl der hochstwertigen Bits der 
Adresse identifiziert wird. 

5. Verfahren nach Anspruch 3, wobei der erste 
AdreBbereich ein Alias- bzw. Deckname-Bereich 
fur einen zweiten AdreBbereich innerhalb des 
AdreBraumes ist. 

6. Verfahren nach Anspruch 4, wobei die Codierung, 
welche sich von der speziellen, vordefinierten Co- 
dierung unterscheidet, eine zweite Adresse inner- 
halb eines zweiten AdreBbereiches aufweist. 

7. Verfahren nach einem der Anspruche 3 bis 6, wobei 
der Schreibvorgang ein Datenstromschreibvorgang 
ist. 

8. Verfahren nach einem der Anspruche 3 bis 7, wel- 
ches weiterhin das Ubersetzen der Adresse in eine 
globale Adresse vor dem Durchfuhren des Koha- 
renzvorganges aufweist. 

9. Verfahren nach einem der vorstehenden Anspru- 
che, welches das Erhalten eines Koharenzzustan- 
des aufweist, der eine Schreiberlaubnis gewahrt. 

10. Verfahren nach Anspruch 9, das weiterhin das 
Ubertragen der Daten zu einem Heimatknoten die- 
ser Adresse nach Erhalt der Schreiberlaubnis auf- 
weist. 

11. Verarbeitungsknoten bzw. Prozessorknoten fur ein 
Mehrprozessorcomputersystem mit verteiltem, ge- 
meinsam verwendetem Speicher, welches eine 
Mehrzahl derartiger Prozessorknoten enthalt, die 
durch ein Netzwerk miteinander verbunden sind, 
wobei der Prozessorknoten zumindest einen Pro- 
zessor, einen Speicher und eine Systemschnittstel- 
le aufweist, welche eine Schnittstellezwischen dem 
Prozessorknoten und dem Netzwerk bildet und wel- 
che eine Koharenz zwischen den Knoten bereit- 
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stellt, wobei: 

ein solcher Prozessor (16A) in einem lokalen 
Prozessorknoten so ausgestaltet ist, daB er ei- 

5 nen Schreibvorgang an eine Koharenzeinheit 

auslost, wobei die Koharenzeinheit einen Hei- 
mat-Prozessorknoten hat, 
die Systemschnittstelle (24) in dem lokalen 
Prozessorknoten so angeschlossen ist, daBsie 

10 den Schreibvorgang empfangt und eine Koha- 

renzbearbeitung an dem Heim at- Prozessor- 
knoten ausfuhrt in Reaktion auf den Schreib- 
vorgang, urn irgendwelche abhangigen Kopien 
(Slave-Kopien) der Koharenzeinheit ungultig 

15 zu machen, 

wobei die Systemschnittstelle weiterhin so 
betreibbar ist, daB sie 

dem Prozessor erlaubt, Daten entsprechend 
20 dem Schreibvorgang von dem Prozessor an die Sy- 
stemschnittstelle zu ubertragen, bevorerdie Besta- 
tigung empfangen hat, daB die abhangigen Kopien 
ungultig gemacht worden sind, wenn der Schreib- 
vorgang einen speziellen, vorbestimmten Code 
25 aufweist, welcher anzeigt, daB der Schreibvorgang 
ein schneller Schreibvorgang ist, wobei derschnel- 
le Schreibvorgang auf eine gesamte Koharenzein- 
heit gerichtet ist, und bei Empfang der Bestatigung, 
den AbschluB der Koharenz an den Heimatprozes- 
30 sorknoten zu ubermitteln, und 

urn den Prozessor daran zu hindern, die Da- 
ten zu ubertragen, bis die Bestatigung empfangen 
worden ist, wenn der Schreibvorgang eine andere 
Codierung aufweist als die spezielle, vorbestimmte 
35 Codierung. 

1 2. Vorrichtung nach Anspruch 1 1 , welche das Erhalten 
eines Koharenzzustandes aufweist, der der Koha- 
renzeinheit, welche durch den Schreibvorgang 

40 identifiziert wird, eine Schreiberlaubnis erteilt. 

13. Vorrichtung nach Anspruch 11, wobei die spezielle, 
vorbestimmte Codierung uber eine Adresse bereit- 
gestellt wird, die in dem Schreibvorgang enthalten 

45 ist. 

14. Vorrichtung nach Anspruch 13, wobei die Adresse 
innerhalb eines ersten AdreBbereiches innerhalb 
eines AdreBraumes liegt, der fur den Prozessor zu- 

50 ganglich ist. 

15. Vorrichtung nach Anspruch 14, wobei der erste 
AdreBbereich ein Aliasbereich bzw. Decknamebe- 
reich fur einen zweiten AdreBbereich innerhalb des 

55 AdreBraumes ist, und wobei die andere Codierung 
eine zweite Adresse aufweist, die innerhalb des 
zweiten AdreBbereiches liegt. 
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16. Computersystem mit: 

einem ersten Prozessorknoten einschlieBlich 
einer Vorrichtung gemaB einem der Anspruche 
11 bis 15, und 

einem 2weiten Prozessorknoten mit einem 
zweiten Prozessor und einer zweiten System- 
schnittstelle, wobei der zweite Prozessorkno- 
ten als der erwahnte Heimatknoten der Koha- 
renzeinheit ausgebildet ist, und wobei der zwei- 
te Prozessorknoten so angeschlossen ist, daB 
ervon dem ersten Prozessorknoten eine Koha- 
renzanforderung empfangt, welche dem Koha- 
renzvorgang entspricht. 

17. Computersystem nach Anspruch 1 6, wobei die vor- 
bestimmte Codierung eine Adresse innerhalb eines 
AdreBbereiches eines AdreBraumes aufweist, wel- 
che dem ersten Prozessorknoten entspricht. 

18. Computersystem nach Anspruch 17, wobei der 
AdreBbereich ein Alias- bzw. Decknamenbereich 
fur einen zweiten AdreBbereich innerhalb des 
AdreBraumes ist. 

19. Computersystem nach einem der Anspruche 16 bis 
18, wobei der erste Prozessorknoten Daten bereit- 
stellt, welche dem Schreibvorgang in den zweiten 
Prozessorknoten entsprechen, nachdem die Koha- 
renzanforderung abgeschlossen worden ist. 



Revendications 

1. Procede pour executer des operations d'ecriture 
coherentes dans un systeme informatique multipro- 
cesseur a m6moire partag6e r6partie comprenant 
une pluralite de noeuds de traitement interconnec- 
tes par un reseau, chaque noeud de traitement 
comprenant au moins un processeur, une mSmoire 
et une interface systeme realisant I'interface entre 
ledit noeud de traitement et ledit reseau et realisant 
la coherence inter-noeuds, le procede comprenant 
les etapes consistant a : 

a I'aide d'un processeur dans un noeud de trai- 
tement local dudit systeme informatique multi- 
processeur, declencher une operation d'ecritu- 
re (312)suruneunitedecoherence, ladite unite 
de coherence comportant un noeud de traite- 
ment initial ; 

a I'aide de ladite interface systeme dudit noeud 
de traitement local, executer une operation de 
coherence (324) sur ledit noeud de traitement 
initial en rSponse a ladite operation d'ecriture 
pour invalider les copies esclaves quelconques 
de ladite unite de coherence ; 
a I'aide de ladite interface systeme, permettre 
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30 



35 



40 



45 



50 8. 



55 



le transfert des donnees (322) correspondant 
a ladite operation d'ecriture dudit processeur a 
ladite interface systeme avant de recevoir la 
confirmation que lesdites copies esclaves ont 
ete invalidees, si ladite operation d'ecriture 
comprend un codage predefini specifique (31 6) 
indiquant que ladite operation d'ecriture est une 
operation d'ecriture rapide, ladite operation 
d'ecriture rapide portant sur une unite de cohe- 
rence entiere, et lors de la reception de ladite 
confirmation, la transmission de la realisation 
de coherence audit noeud de traitement initial, 
et 

a I'aide de ladite interface systeme, interdire le 
transfert desdites donnees dudit processeur 
jusqu'a ce que ladite confirmation ait ete recue, 
si ladite operation d'ecriture comprend un co- 
dage different dudit codage predefini specifi- 
que (318). 

Procede selon la revendication 1, dans lequel ledit 
codage predefini specifique est fourni par rinterme- 
diaire d'une adresse accompagnant ladite opera- 
tion d'ecriture. 

Procede selon la revendication 2, dans lequel ladite 
adresse se trouve au sein d'une premiere region 
d'adresses, qui se trouve dans un espace d'adres- 
ses dudit noeud de traitement local. 

Procede selon la revendication 3, dans lequel ladite 
premiere region d'adresses est identifi6e par une 
valeur particuliere au sein d'une pluralite de bits de 
poids fort de ladite adresse. 

Procede selon la revendication 3, dans lequel ladite 
premiere region d'adresses est un pseudonyme 
d'une deuxieme region d'adresses dans ledit espa- 
ce d'adresses. 

Procede selon la revendication 4, dans lequel ledit 
codage different dudit codage predefini specifique 
comprend une deuxieme adresse au sein d'une 
deuxieme region d'adresses. 

Procede selon I'une quelconque des revendications 
3 a 6, dans lequel ladite operation d'ecriture est une 
operation de train de donnees d'ecriture. 

Proc6d6 selon I'une quelconque des revendications 
3 a 7, comprenant en outre I'etape consistant a tra- 
duire ladite adresse en une adresse globale avant 
ladite etape d'execution de ladite operation de co- 
herence. 

Procede selon I'une quelconque des revendications 
pr6c6dentes, comprenant I'etape consistant a ob- 
tenir un etat de coherence qui accorde la permis- 
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sion d'ecrire. 

10. Procede selon la revendication 9, comprenant en 
outre I'etape consistant a transferer lesdites don- 
nees a un noeud initial de ladite adresse lors de I'ob- 
tention de ladite permission d'ecrire. 

11. Noeud detraitement pourun systeme informatique 
multiprocesseur a memoire partagee repartie qui 
comprend une pluralite de ces noeuds de traite- 
ment interconnected par un reseau, ledit noeud de 
traitement comprenant au moins un processeur, 
une memoire et une interface systeme realisant I'in- 
terface entre ledit noeud de traitement et ledit re- 
seau, et realisant la coherence inter-noeuds, dans 
lequel : 

ledit processeur (16A) dans un noeud de trai- 
tement local est configure pour declencher une 
operation d'ecriture sur une unite de coheren- 
ce, ladite unite de coherence possedant un 
noeud de traitement initial ; 
ladite interface systeme (24) dans ledit noeud 
de traitement local estcouplee pour recevoir la- 
dite operation d'ecriture et pour executer une 
operation de coherence sur ledit noeud de trai- 
tement initial en reponse a ladite operation 
d'ecriture pour invalider les copies esclaves 
quelconques de ladite unite de coherence, 
ladite interface systeme pouvantfonctionner en 
outre 

pour permettre audit processeur de 
transferer des donnees correspondant a ladite 
operation d'ecriture dudit processeur a ladite 
interface systeme avant de recevoir la confir- 
mation que lesdites copies esclaves ont ete in- 
validees, si ladite operation d'ecriture com- 
prend un codage predefini specifique indiquant 
que ladite operation d'ecriture est une opera- 
tion d'ecriture rapide, ladite operation d'ecriture 
rapide portant sur une unite de coherence en- 
tiere, et lors de la reception de ladite confirma- 
tion, de transmettre ladite realisation de cohe- 
rence audit noeud de traitement initial, et 

pour interdire audit processeur de trans- 
ferer lesdites donnees jusqu'a ce que ladite 
confirmation soit recue si ladite operation 
d'ecriture comprend un codage different dudit 
codage predefini specifique. 

12. Dispositif selon la revendication 11, comprenant 
Contention d'un etat de coherence qui accorde la 
permission d'ecrire sur ladite unite de coherence 
identifiee par ladite operation d'ecriture. 

13. Dispositif selon la revendication 11 , dans lequel le- 
dit codage predefini specifique est fourni par I'inter- 
mediaire d'une adresse accompagnant ladite ope- 
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ration d' Venture. 

14. Dispositif selon la revendication 13, dans lequel la- 
dite adresse se trouve au sein d'une premiere re- 
gion d'adresses dans un espace d'adresses acces- 
sible audit processeur. 

15. Dispositif selon la revendication 14, dans lequel la- 
dite premiere region d'adresses est un pseudony- 

sdans ledit es- 
t codage diffe- 
se se trouvant 
adresses. 



un premier noeud de traitement comprenant le 
dispositif selon I'une quelconque des revendi- 
cations 11 a 15 ; et 

un deuxieme noeud de traitement comprenant 
un deuxieme processeur et une deuxieme in- 
terface systeme, dans lequel le deuxieme 
noeud de traitement est configure comme ledit 
noeud initial de ladite unite de coherence, et 
dans lequel ledit deuxieme noeud detraitement 
est couple pour recevoir une demande de co- 
herence correspondant a ladite operation de 
coherence en provenance dudit premier noeud 
de traitement. 

30 

17. Systeme informatique selon la revendication 16, 
dans lequel ledit codage predefini comprend une 
adresse au sein d'une region d'adresses d'un espa- 
ce d'adresses correspondant audit premier noeud 

35 de traitement. 

18. Systeme informatique selon la revendication 17, 
dans lequel ladite region d'adresses est un pseudo- 
nyme d'une deuxieme region d'adresses dans ledit 
espace d'adresses. 

19. Systeme informatique selon I'une quelconque des 
revendications 16 a 18, dans lequel ledit premier 
noeud de traitement fournit des donnees corres- 
pondant a ladite operation d'ecriture audit deuxie- 
me noeud de traitement apres I'execution de ladite 
demande de coherence. 
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10 me d'une deuxieme region d'adressej 
pace d'adresses, et dans lequel ledil 
rent comprend une deuxieme adres; 
au sein de ladite deuxieme region d' 

15 16. Systeme informatique, comprenant : 
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